📢 MASSIVE: This new paper proved GPT-5 (medium) now far exceeds (>20%) pre-licensed human experts on medical reasoning and understanding benchmarks.
GPT-5 beats human experts on MedXpertQA multimodal by 24.23% in reasoning and 29.40% in understanding, and on MedXpertQA text by 15.22% in reasoning and 9.40% in understanding. 🔥
It compares GPT-5 to actual professionals in good standing and claims AI is ahead.
GPT-5 is tested as a single, generalist system for medical question answering and visual question answering, using one simple, zero-shot chain of thought setup.
⚙️ The Core Concepts
The paper positions GPT-5 as a generalist multimodal reasoner for decision support, meaning it reads clinical text, looks at images, and reasons step by step under the same setup.
The evaluation uses a unified protocol, so prompts, splits, and scoring are standardized to isolate model improvements rather than prompt tricks.
---
My take: The medical sector takes one of the biggest share of national budgets across the globe, even in the USA, where it surpasses military spending.
Once AI or robots can bring down costs, governments everywhere will quickly adopt them because it’s like gaining extra funds without sparking political controversy.
🧬 Bad news for medical LLMs.
This paper finds that top medical AI models often match patterns instead of truly reasoning.
Small wording tweaks cut accuracy by up to 38% on validated questions.
The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct.
If a model truly reasons, it should still reach the same clinical decision despite that label swap.
They asked each model to explain its steps before answering and compared accuracy on the original versus modified items.
All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped.
That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic.
Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.