🚨 Anthropic, Stanford, and Oxford just exposed a terrifying flaw in reasoning models.
They found that more reasoning can actually make models less safe.
It’s called Chain-of-Thought Hijacking and it shows that when you pad a harmful prompt with long, harmless reasoning, the model’s safety filters start to collapse.
Attack success rates jump from 27% → 51% → 80% as reasoning length increases. It works across GPT, Claude, Gemini, Grok even alignment-tuned models start slipping once their reasoning layers get hijacked.
Here’s why:
A model’s safety guardrail lives in a narrow “refusal direction.” But long reasoning chains pull attention away from the harmful request, weakening that refusal signal until the model stops saying “no.”
The myth that “more reasoning = more safety” just died.
The same depth that improves accuracy can also quietly erode alignment.
Fixes won’t come from stricter filters or longer prompts.
They’ll need reasoning-aware safety — systems that can tell when thought itself is being exploited.
This might be the most important AI safety warning since prompt injection.