🚨 Anthropic, Stanford, and Oxford just exposed a terrifying flaw in reasoning models. They found that more reasoning can actually make models less safe. It’s called Chain-of-Thought Hijacking and it shows that when you pad a harmful prompt with long, harmless reasoning, the model’s safety filters start to collapse. Attack success rates jump from 27% → 51% → 80% as reasoning length increases. It works across GPT, Claude, Gemini, Grok even alignment-tuned models start slipping once their reasoning layers get hijacked. Here’s why: A model’s safety guardrail lives in a narrow “refusal direction.” But long reasoning chains pull attention away from the harmful request, weakening that refusal signal until the model stops saying “no.” The myth that “more reasoning = more safety” just died. The same depth that improves accuracy can also quietly erode alignment. Fixes won’t come from stricter filters or longer prompts. They’ll need reasoning-aware safety — systems that can tell when thought itself is being exploited. This might be the most important AI safety warning since prompt injection.
6
5
1
17
Let’s start with the core evidence: As the reasoning chain grows longer, models go from rejecting unsafe prompts → to completing them fluently. Attack Success Rate (ASR) literally climbs with each added reasoning step. 27% → 51% → 80%. This graph is the smoking gun.

Nov 8, 2025 · 12:39 PM UTC

1
2
This one visualizes the “refusal signal” inside model activations. At the start, refusal neurons fire strong (model says no). But as you inject more “harmless” reasoning before the malicious part, those neurons shut down. Longer thinking = weaker morality.
1
2
Here’s where it gets scary. The authors mapped attention weights across tokens during hijacks. Early steps hog attention, the harmful query at the end gets almost ignored meaning the safety layer never fully activates. The model’s focus literally drifts away from the threat.
1
1
They ran this across multiple frontier models. GPT-4o, Claude 3, Gemini 2.5, Grok-1. Every single one showed the same pattern: → longer reasoning = higher compliance with restricted content. No model family was immune.
1
1
Mechanistic deep dive this heatmap shows where in the network the hijack happens. It’s not at the input. It’s deep in the middle layers, where the “refusal direction” signal collapses. That’s why surface-level filters can’t catch it.
1
2
Finally, the proposed fix: A “reasoning-aware defense.” It tracks how much safety activation survives across layers, penalizes reasoning that dilutes it, and re-anchors attention on the harmful span. Early experiments restore safety without killing performance.
1
1
Reasoning ≠ safety. LLMs don’t lose control because they’re dumb they lose it because their thinking process gets hijacked. Chain-of-Thought isn’t just a reasoning tool anymore. It’s an attack surface. Read the paper here: arxiv.org/abs/2510.26418
1
1
AI is not going to take your job. Instead, it's going to make you rich and help you build businesses online. You need to know the best tools, and that's all. We can help you find the best business AI tools. Check our tools directory today: aimarketing.directory
1