📢 MASSIVE: This new paper proved GPT-5 (medium) now far exceeds (>20%) pre-licensed human experts on medical reasoning and understanding benchmarks. GPT-5 beats human experts on MedXpertQA multimodal by 24.23% in reasoning and 29.40% in understanding, and on MedXpertQA text by 15.22% in reasoning and 9.40% in understanding. 🔥 It compares GPT-5 to actual professionals in good standing and claims AI is ahead. GPT-5 is tested as a single, generalist system for medical question answering and visual question answering, using one simple, zero-shot chain of thought setup. ⚙️ The Core Concepts The paper positions GPT-5 as a generalist multimodal reasoner for decision support, meaning it reads clinical text, looks at images, and reasons step by step under the same setup. The evaluation uses a unified protocol, so prompts, splits, and scoring are standardized to isolate model improvements rather than prompt tricks. --- My take: The medical sector takes one of the biggest share of national budgets across the globe, even in the USA, where it surpasses military spending. Once AI or robots can bring down costs, governments everywhere will quickly adopt them because it’s like gaining extra funds without sparking political controversy.
🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

Aug 30, 2025 · 7:40 AM UTC

Then performance of GPT 5 vs Expert Pre-Licensed humans.
1
8
🧪 The Benchmarks Text-only coverage includes MedQA, MMLU medical subsets, and official USMLE self‑assessment PDFs, which together check factual recall and clinical reasoning across many specialties. Multimodal coverage includes MedXpertQA MM, a harder set with images plus rich patient context, and VQA-RAD, a radiology set with 2,244 Q‑A pairs, 314 images, and 251 yes/no test cases. MedXpertQA in total spans 4,460 questions across 17 specialties and 11 body systems, which makes it useful for stress‑testing expert reasoning.
2
6
⌨️ The Prompting Setup Each example runs as a 2‑turn chat, first asking for a chain of thought rationale with “Let’s think step by step”, then forcing a final single‑letter choice. For visual questions, images are attached to the first turn as image_url, the reasoning stays free‑form, then the second turn again narrows to one letter, which makes scoring clean and comparable.
1
3
👩‍⚕️ Human Comparison Against pre‑licensed human experts on MedXpertQA, GPT‑4o trails by 5.03% to 15.90% on several axes, but GPT‑5 flips this, beating experts by +15.22% and +9.40% on text reasoning and understanding, and by +24.23% and +29.40% on multimodal reasoning and understanding. This shifts the model from human‑comparable to consistently above human on these standardized, time‑boxed evaluations.
1
3
🧪 A Representative Case The model links repeated vomiting, suprasternal crepitus, and CT findings, flags likely esophageal perforation, and recommends a Gastrografin swallow as the next step. It explains why alternatives like antiemetics or supportive care alone would miss a high‑risk condition, showing structured clinical reasoning, not just pattern matching.
1
4
🧭 What this all means A single prompting recipe, zero‑shot chain of thought plus a forced final letter, is enough to expose the gap between model generations in both text and image settings. The biggest jumps appear where the task needs multi‑hop reasoning across images and narrative context, exactly where older systems lagged.
2
5
Replying to @rohanpaul_ai
No they won't. Government is always the last to adopt anything
1
1
Even if they do it last, still will be huge.
1
Replying to @rohanpaul_ai
This is a huge leap. The real test will be seeing how this translates to real-world clinical workflows. The gap between benchmark performance and actual utility can be tricky to bridge.
1
1
Replying to @rohanpaul_ai
Rohan… first those in the medical field need to stop trashing AI. I agree. We need AI in healthcare. Badly. •
1
Replying to @rohanpaul_ai
Intriguing. What implications will this have for medical training and collaboration with AI?
Replying to @rohanpaul_ai
Your point about cost reduction is key, but the adoption curve won't be smooth. The biggest hurdle won't be the tech itself, but integrating it into existing clinical workflows and establishing clear liability frameworks. That's where the real political controversy will emerge.
1
Replying to @rohanpaul_ai
Don't worry. Doctors will fight this with policy.
1
1
Replying to @rohanpaul_ai
systems are interconnected: a tiny imbalance far away can ripple into local symptoms. The root cause may be remote and hidden, while the visible issue is just its echo. Treating symptoms without tracing causes risks missing the real problem.
1
Replying to @rohanpaul_ai
AI shifts the risk: advice without consequence is cheap, but execution without understanding is deadly. The advisor bears nothing, the actor bears everything. That imbalance is the real danger of AI in medicine — risk without responsibility.
1
Replying to @rohanpaul_ai
advice without consequence costs nothing to give — the advisor walks away. But execution lands on those who act. If you don’t understand what you’re executing, you carry full risk. With AI, that gap makes advice cheap, but outcomes deadly.
1
Replying to @rohanpaul_ai
Advice without consequence is cheap; execution without understanding is deadly. That’s the problem with AI “advisors” — no accountability, yet real-world costs fall on those who act. Benchmarks ≠ bedside, and medicine can’t afford this gap.
1
Replying to @rohanpaul_ai
Vibe Medicine 🩺
1
Replying to @rohanpaul_ai
So yesterday's paper can go in the bin?
Replying to @rohanpaul_ai
If ChatGPT can output "I don't know", then it is probably ready for use. We really want 100% correct when it decides to answer, and minimize the give-up rate (when the task will be passed to a human). This enables coworking between AI and humans. The metric has to be changed.
Replying to @rohanpaul_ai
the brain can’t hold the full chaos of reality. It filters, simplifies, and makes straight-line stories so we can act. Useful for survival — but in complex systems, that compression hides the real, tangled causes.
Replying to @rohanpaul_ai
the mind craves simplicity — straight lines, clear causes, neat stories. But reality is tangled: networks, feedback loops, hidden delays. We compress to understand, but in doing so, we miss the deeper, messy webs where true causes live.
Replying to @rohanpaul_ai
invisible links don’t fit our mental models — they’re indirect, delayed, or counterintuitive. Both humans and AI default to the obvious surface patterns, so the true remote cause stays hidden until it shocks us later.
Replying to @rohanpaul_ai
complex systems don’t reveal their wiring. Causes can be nonlinear, delayed, and hidden in places we’d never expect. Mapping that chain takes deep insight, patience, and context — shortcuts, whether human or AI, usually miss the invisible links.
Replying to @rohanpaul_ai
Going from surface symptoms to remote root causes is hard because systems hide their links. What’s visible is local, but the true driver may be buried, indirect, and non-obvious. Mapping that chain takes deep insight — AI or human shortcuts often miss it.
Replying to @rohanpaul_ai
I know many with no real idea what’s going on, using AI to pretend they’ve got answers. It patches surface problems while ignoring deeper unknown root causes — far outside its scope. That’s not progress, that’s dangerous illusion.
Replying to @rohanpaul_ai
Biggest risk isn’t just AI — it’s those with the leverage of AI, executing with full accountability but no idea what’s going on. Advice without consequence is cheap; execution without understanding is deadly.
Replying to @rohanpaul_ai
Why have advisors with none? With AI, if you listen and execute, you bear the cost. That’s the huge problem: advice without accountability. Benchmarks can shine, but reality demands responsibility — especially in medicine.
Replying to @rohanpaul_ai
AI benchmarks impress, but in reality advice has consequences. Now we’ve got advisors with none — if you listen and execute, you bear the cost. Benchmarks ≠ bedside. Progress matters, but medicine needs accountability, not hype.
Replying to @rohanpaul_ai
AI can give advice, but advice is not reality. Benchmarks ≠ bedside. Letting AI guide real-world care without consequences is not to be taken lightly — lives aren’t test scores. Progress is exciting, but medicine demands more than accuracy.