Rohan Paul · Aug 29, 2025 · 6:01 AM UTC

Rohan Paul

Rohan Paul

@rohanpaul_ai

Aug 29

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

218

484

219

2,517

RobinLehmann.base.eth 🖇️ (ElizaOS/acc) · Aug 29, 2025 · 6:10 AM UTC

RobinLehmann.base.eth 🖇️ (ElizaOS/acc)

@w1kke

Aug 29

What could solve that problem? Fine tuning? A vast RAG system or a knowledge graph to help?

Rohan Paul · Aug 29, 2025 · 6:23 AM UTC

Rohan Paul · Aug 29, 2025 · 6:23 AM UTC

Rohan Paul

@rohanpaul_ai

Aug 29

Replying to @w1kke

well, this is one negative study, and then there are hundreds of studies proving LLM's capability in medical space.

Aug 29, 2025 · 6:23 AM UTC

Michael Wolfe · Aug 29, 2025 · 5:32 PM UTC

Michael Wolfe

@michaelrwolfe

Aug 29

Replying to @rohanpaul_ai @w1kke

I think the study just looked at using retail LLM‘s with zero prompt engineering or Evans, which of course all commercial products would have

Pitfall Harry · Aug 29, 2025 · 1:02 PM UTC

Pitfall Harry

@JKD_ff

Aug 29

Replying to @rohanpaul_ai @w1kke

"capability" is not "reasoning" These models are highly capable of brute force pattern recognition and pastiche generation. That will never be reasoning, despite Sam Altman's supercilious claim that we are all just stochastic parrots.

Dominik Lukes · Aug 29, 2025 · 7:41 AM UTC

Dominik Lukes

@techczech

Aug 29

Replying to @rohanpaul_ai @w1kke

This is a weird study. They manipulate MCQs on a standard exam. I don't think the conclusions follow from this without some more variations on the study. Also, not the current frontier models by a large margin.

This tweet is unavailable