Rohan Paul (@rohanpaul_ai): "🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning." | ab4n

Rohan Paul

@rohanpaul_ai

Aug 29

🧬 Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.

Aug 29, 2025 · 6:01 AM UTC

2,517

Rohan Paul

@rohanpaul_ai

Aug 29

Rohan Paul

@rohanpaul_ai

Aug 29

Another concerning findings on AI use in Medical. AI assistance boosted detection during AI-guided cases, but when the same doctors later worked without AI their detection rate fell from 28.4% before AI to 22.4% after AI exposure. The research studies the de-skilling effect of AI by researchers from Poland, Norway, Sweden, the U.K., and Japan. So when using AI, AI boosts the adenoma detection rate (ADR) by 12.5%, which could translate into lives saved. The problem is that without AI, detection falls to levels lower than before doctors ever used it, according to research published in The Lancet Gastroenterology & Hepatology. The study raises questions about the use of AI in healthcare, when it helps and when it could hurt.

32

Dr. Julie Gurner

@drgurner

Aug 29

Replying to @rohanpaul_ai

Of course. All models aren't really "reasoning" at this stage.

73

Rohan Paul

@rohanpaul_ai

Aug 29

Yes, that too.

6

RobinLehmann.base.eth 🖇️ (ElizaOS/acc)

@w1kke

Aug 29

Replying to @rohanpaul_ai

What could solve that problem? Fine tuning? A vast RAG system or a knowledge graph to help?

9

Rohan Paul

@rohanpaul_ai

Aug 29

well, this is one negative study, and then there are hundreds of studies proving LLM's capability in medical space.

13

Trace Vertical Ai Cohen

@Trace_Cohen

Aug 29

Replying to @rohanpaul_ai

Not really bad news… early study that will only get better. Basically internet dial up now until we get broadband WiFi

6

Rohan Paul

@rohanpaul_ai

Aug 29

Oh yes, absolutely.

4

Teng Yan · Chain of Thought AI

@tengyanAI

Aug 29

Replying to @rohanpaul_ai

you can hear the collective sigh of doctors globally, relieved to be still keeping their jobs

5

Rohan Paul

@rohanpaul_ai

Aug 29

Yeah 😀😀

2

nisten🇨🇦e/acc

@nisten

Aug 29

Replying to @rohanpaul_ai

deepseek R1 did great tho

3

Rohan Paul

@rohanpaul_ai

Aug 29

what a great model they produced.

2

Stephen Winters

@stephen_winters

Aug 29

Replying to @rohanpaul_ai

Lol wait til you hear from doctors. We don’t reason either. It’s all protocol / evidence based practice. LLMs probably do the same.

143

Rohan Paul

@rohanpaul_ai

Aug 29

😀😀 Ha ha

12

Rithesh

@AIWithRithesh

Aug 29

Replying to @rohanpaul_ai

Interesting findings. Probably this evaluation should be done again with Gemini 2.5 Pro , GPT5 Pro, Claude 4.0 . The models they have used seem a little old now: "We evaluated 6 models spanning different architectures and capabilities: DeepSeek-R1 (model 1), o3-mini (reasoning models) (model 2), Claude-3.5 Sonnet (model 3), Gemini-2.0-Flash (model 4), GPT-4o (model 5), and Llama-3.3-70B (model 6)"

60

Rohan Paul

@rohanpaul_ai

Aug 29

yes, many studies uses old model, guess to reduce their eval cost

25

☃️Darth thromBOOzyt📯 @krasmanalderey

Aug 29

Replying to @rohanpaul_ai

Too bad they couldn't test o3/2.5pro or gpt-5

8

Rohan Paul

@rohanpaul_ai

Aug 29

yes, they were proly reducing their eval cost

1

Fer Cuadra 🌞🌕 @fernandoquadra

Aug 29

Replying to @rohanpaul_ai

Gary Marcus is in the room... @GaryMarcus

8

Rohan Paul

@rohanpaul_ai

Aug 29

😀

1

Eljon @exploretothrive

Aug 29

Replying to @rohanpaul_ai

Sounds much like Apple's study. LLMs don't reason.

7

Rohan Paul

@rohanpaul_ai

Aug 29

Almost 😀

1

Patrick @phorne96

Aug 29

Replying to @rohanpaul_ai

My experience with many medical professionals is that they also are merely pattern matching and not really reasoning.

6

Rohan Paul

@rohanpaul_ai

Aug 29

Yes 👊👊

lehai0609 @lehai0609

Aug 29

Replying to @rohanpaul_ai

Ffs, doctors matches patterns too. That's called experience

4

Rohan Paul

@rohanpaul_ai

Aug 29

Good point 😀

1

Only One @Amadeeoha

Aug 29

Replying to @rohanpaul_ai

Bad news? Medicine is already about pattern matching

2

Rohan Paul

@rohanpaul_ai

Aug 29

Yep 😄

øx_dominus @0xdominus

Aug 29

Replying to @rohanpaul_ai

“AI will come for your jobs” 😮‍💨

2

Rohan Paul

@rohanpaul_ai

Aug 29

it has actually arrived already 😄

Rohan Paul

@rohanpaul_ai

Aug 26

💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University. THE SHIFT HAS STARTED. Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing. Though overall employment continues to grow, employment growth for young workers in particular has been stagnant. The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration. 22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls. ⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement. AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate. 🧵 Read on 👇

3

Pranjal Chaubey

@pranjal_chaubey

Aug 30

Replying to @rohanpaul_ai

Most doctors are matching patterns as well.

1

Rohan Paul

@rohanpaul_ai

Aug 30

very much

Ed Newton-Rex

@ednewtonrex

Aug 29

Replying to @rohanpaul_ai

It’s almost as if they are next-token predictors

32