Rohan Paul · Aug 29, 2025 · 6:01 AM UTC

Rohan Paul

Rohan Paul

@rohanpaul_ai

Aug 29

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

218

484

219

2,517

Rithesh · Aug 29, 2025 · 7:16 AM UTC

Rithesh · Aug 29, 2025 · 7:16 AM UTC

Rithesh

@AIWithRithesh

Aug 29

Replying to @rohanpaul_ai

Interesting findings. Probably this evaluation should be done again with Gemini 2.5 Pro , GPT5 Pro, Claude 4.0 . The models they have used seem a little old now: "We evaluated 6 models spanning different architectures and capabilities: DeepSeek-R1 (model 1), o3-mini (reasoning models) (model 2), Claude-3.5 Sonnet (model 3), Gemini-2.0-Flash (model 4), GPT-4o (model 5), and Llama-3.3-70B (model 6)"

Aug 29, 2025 · 7:16 AM UTC

Rohan Paul · Aug 29, 2025 · 7:29 AM UTC

Rohan Paul

@rohanpaul_ai

Aug 29

Replying to @AIWithRithesh

yes, many studies uses old model, guess to reduce their eval cost

Kunal Gupta (self/acc) · Aug 30, 2025 · 10:40 PM UTC

Kunal Gupta (self/acc) @djkgamc

Aug 30

Replying to @AIWithRithesh @rohanpaul_ai

not interesting finding because largely not reasoning models and certainly not good ones even at the time

AnKo · Aug 29, 2025 · 5:24 PM UTC

AnKo @anko_979

Aug 29

Replying to @AIWithRithesh @rohanpaul_ai

Also this ambiguous statement "we compared each model’s performance with chain-of-thought (CoT) prompting on the 68 questions" Did they really enable reasoning to Sonnet or trying to make it CoT with no reasoning enabled? Which is very different

oxycoffin · Aug 29, 2025 · 11:20 PM UTC

oxycoffin @Oxycoffin_

Aug 29

Replying to @AIWithRithesh @rohanpaul_ai

Lol those models are not only ancient, but they were not even SOTA when they came out. Why o3 mini instead of o3?