Rohan Paul · Aug 29, 2025 · 6:01 AM UTC

Rohan Paul

Rohan Paul

@rohanpaul_ai

Aug 29

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

218

484

219

2,517

☃️Darth thromBOOzyt📯 · Aug 29, 2025 · 7:27 AM UTC

☃️Darth thromBOOzyt📯 @krasmanalderey

Aug 29

Too bad they couldn't test o3/2.5pro or gpt-5

Rohan Paul · Aug 29, 2025 · 7:30 AM UTC

Rohan Paul · Aug 29, 2025 · 7:30 AM UTC

Rohan Paul

@rohanpaul_ai

Aug 29

Replying to @krasmanalderey

yes, they were proly reducing their eval cost

Aug 29, 2025 · 7:30 AM UTC

☃️Darth thromBOOzyt📯 · Aug 29, 2025 · 7:33 AM UTC

☃️Darth thromBOOzyt📯 @krasmanalderey

Aug 29

Replying to @rohanpaul_ai

Maybe, saw this with a couple other benchmarks recently too... Maybe the frontier labs should have some way of offering free tokens to researchers for benchmarking a system(esp. the -Pro ones). Would help both sides significantly. (Maybe tough to control for misuse (?))

InfiniteHexx · Aug 29, 2025 · 7:49 AM UTC

InfiniteHexx @InfiniteHexx

Aug 29

Replying to @rohanpaul_ai @krasmanalderey

Yes, but reducing their eval costs at the expense of relevancy. These models are no longer used and significantly outdated. This would have been fine for 2024, but the industry has moved way past. And yet people will cite this obsolete study, so why bother?