Rohan Paul · Aug 29, 2025 · 6:01 AM UTC

Rohan Paul

Rohan Paul

@rohanpaul_ai

Aug 29

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

218

484

219

2,517

Rithesh · Aug 29, 2025 · 7:16 AM UTC

Rithesh

@AIWithRithesh

Aug 29

Interesting findings. Probably this evaluation should be done again with Gemini 2.5 Pro , GPT5 Pro, Claude 4.0 . The models they have used seem a little old now: "We evaluated 6 models spanning different architectures and capabilities: DeepSeek-R1 (model 1), o3-mini (reasoning models) (model 2), Claude-3.5 Sonnet (model 3), Gemini-2.0-Flash (model 4), GPT-4o (model 5), and Llama-3.3-70B (model 6)"

Rohan Paul · Aug 29, 2025 · 7:29 AM UTC

Rohan Paul · Aug 29, 2025 · 7:29 AM UTC

Rohan Paul

@rohanpaul_ai

Aug 29

Replying to @AIWithRithesh

yes, many studies uses old model, guess to reduce their eval cost

Aug 29, 2025 · 7:29 AM UTC

Kirk Patrick Miller · Aug 29, 2025 · 1:25 PM UTC

Kirk Patrick Miller

@Chaos2Cured

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

No… because these studies are done with confirmation bias. The very people who do this study NEED the AI to fail. I call them out. More games by humans that wish to obscure truth. •

Red · Aug 29, 2025 · 1:59 PM UTC

Red

@TheRedWall__

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

And cheaper models. Calling out limitations in flash or mini models without testing more capable models is a disservice

Marius Vach · Aug 29, 2025 · 7:38 AM UTC

Marius Vach

@rasmus1610

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

a symptom of the slow peer review process.

KitCat · Aug 29, 2025 · 3:32 PM UTC

KitCat

@CattalyyaN

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

I don’t think reducing eval cost justifies because they only have 100 questions.

Ilya Venger · Aug 29, 2025 · 9:54 PM UTC

Ilya Venger @ivenger

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

Really no excuse to use mini models (o3-mini). The title is misleading it should say that distilled models and those from the previous generation withhut specialized prompting are not robust... duh

Chris Cardinal · Aug 30, 2025 · 11:53 PM UTC

Chris Cardinal @chriscardinal

Aug 30

Replying to @rohanpaul_ai @AIWithRithesh

Ran it in Gemini 2.5 Pro and it reasoned the sample through right to the correct answer. Using Flash is just so far away from being representative of the SOTA with regards especially to any semblance of actual "reasoning"

FeltSteam0 · Aug 29, 2025 · 8:56 PM UTC

FeltSteam0

@FeltSteam

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

If this is true why didn't they use o4-mini? It is the same price of o3-mini (+ cheaper caching). With only 100 questions even using Gemini 2.5 Pro wouldn't have been extremely expensive.

Respect · Aug 29, 2025 · 3:56 PM UTC

Respect @rrespectorr

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

That makes no sense since a lot of the models used are simply more expensive than the current frontier ones

Vojtech Müller · Aug 29, 2025 · 1:02 PM UTC

Vojtech Müller @cobalamin_12

Aug 29

Replying to @rohanpaul_ai @AIWithRithesh

Or maybe because doing research actually takes time?