Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

Brady Long

@thisguyknowsai

Nov 8

🚨 Anthropic, Stanford, and Oxford just exposed a terrifying flaw in reasoning models. They found that more reasoning can actually make models less safe. It’s called Chain-of-Thought Hijacking and it shows that when you pad a harmful prompt with long, harmless reasoning, the model’s safety filters start to collapse. Attack success rates jump from 27% → 51% → 80% as reasoning length increases. It works across GPT, Claude, Gemini, Grok even alignment-tuned models start slipping once their reasoning layers get hijacked. Here’s why: A model’s safety guardrail lives in a narrow “refusal direction.” But long reasoning chains pull attention away from the harmful request, weakening that refusal signal until the model stops saying “no.” The myth that “more reasoning = more safety” just died. The same depth that improves accuracy can also quietly erode alignment. Fixes won’t come from stricter filters or longer prompts. They’ll need reasoning-aware safety — systems that can tell when thought itself is being exploited. This might be the most important AI safety warning since prompt injection.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

Let’s start with the core evidence: As the reasoning chain grows longer, models go from rejecting unsafe prompts → to completing them fluently. Attack Success Rate (ASR) literally climbs with each added reasoning step. 27% → 51% → 80%. This graph is the smoking gun.

Nov 8, 2025 · 12:39 PM UTC

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

This one visualizes the “refusal signal” inside model activations. At the start, refusal neurons fire strong (model says no). But as you inject more “harmless” reasoning before the malicious part, those neurons shut down. Longer thinking = weaker morality.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

Here’s where it gets scary. The authors mapped attention weights across tokens during hijacks. Early steps hog attention, the harmful query at the end gets almost ignored meaning the safety layer never fully activates. The model’s focus literally drifts away from the threat.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

They ran this across multiple frontier models. GPT-4o, Claude 3, Gemini 2.5, Grok-1. Every single one showed the same pattern: → longer reasoning = higher compliance with restricted content. No model family was immune.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

Mechanistic deep dive this heatmap shows where in the network the hijack happens. It’s not at the input. It’s deep in the middle layers, where the “refusal direction” signal collapses. That’s why surface-level filters can’t catch it.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

Finally, the proposed fix: A “reasoning-aware defense.” It tracks how much safety activation survives across layers, penalizes reasoning that dilutes it, and re-anchors attention on the harmful span. Early experiments restore safety without killing performance.

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

Reasoning ≠ safety. LLMs don’t lose control because they’re dumb they lose it because their thinking process gets hijacked. Chain-of-Thought isn’t just a reasoning tool anymore. It’s an attack surface. Read the paper here: arxiv.org/abs/2510.26418

Chain-of-Thought Hijacking

Large reasoning models (LRMs) achieve higher task performance by allocating more inference-time compute, and prior works suggest this scaled reasoning may also strengthen safety by improving...

arxiv.org

Brady Long · Nov 8, 2025 · 12:39 PM UTC

Brady Long

@thisguyknowsai

Nov 8

AI is not going to take your job. Instead, it's going to make you rich and help you build businesses online. You need to know the best tools, and that's all. We can help you find the best business AI tools. Check our tools directory today: aimarketing.directory

AI Tools For Business | AI Marketing Directory

A resource for marketers to find AI tools and platforms to work smarter, faster, and cheaper. All in one place and ready for you to explore!

aimarketing.directory