🧬 Bad news for medical LLMs.
This paper finds that top medical AI models often match patterns instead of truly reasoning.
Small wording tweaks cut accuracy by up to 38% on validated questions.
The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct.
If a model truly reasons, it should still reach the same clinical decision despite that label swap.
They asked each model to explain its steps before answering and compared accuracy on the original versus modified items.
All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped.
That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic.
Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.
Aug 29, 2025 · 6:01 AM UTC
Another concerning findings on AI use in Medical.
AI assistance boosted detection during AI-guided cases, but when the same doctors later worked without AI their detection rate fell from 28.4% before AI to 22.4% after AI exposure.
The research studies the de-skilling effect of AI by researchers from Poland, Norway, Sweden, the U.K., and Japan.
So when using AI, AI boosts the adenoma detection rate (ADR) by 12.5%, which could translate into lives saved.
The problem is that without AI, detection falls to levels lower than before doctors ever used it, according to research published in The Lancet Gastroenterology & Hepatology.
The study raises questions about the use of AI in healthcare, when it helps and when it could hurt.




















