New research from Anthropic basically hacked into Claude’s brain.
Shows Claude can sometimes notice and name a concept that engineers inject into its own activations, which is functional introspection.
They first watch how the model’s neurons fire when it is talking about some specific word.
Then they average those activation patterns across many normal words to create a neutral “baseline.”
Finally, they subtract that baseline from the activation pattern of the target word.
The result — the concept vector — is what’s unique in the model’s brain for that word.
They can then add that vector back into the network while it’s processing something else to see if the model feels that concept appear in its thoughts
The scientists directly changed the inner signals inside Claude’s brain to make it “think” about the idea of "betrayal", even though the word never appeared in its input or output.
i.e. the scientists figured out which neurons usually light up when Claude talks about betrayal. Then, without saying the word, they artificially turned those same neurons on — like flipping the “betrayal” switch inside its head.
Then they asked Claude, “Do you feel anything different?”
Surprisingly, it replied that it felt an intrusive thought about “betrayal.”
That happened before the word “betrayal” showed up anywhere in its written output.
That’s shocking because no one told it the word “betrayal.” It just noticed that its own inner pattern had changed and described it correctly.
The point here is to show that Claude isn’t just generating text — sometimes it can recognize changes in its own internal state, a bit like noticing its own thought patterns.
It doesn’t mean it’s conscious, but it suggests a small, measurable kind of self-awareness in how it processes information.
Teams should still treat self reports as hints that need outside checks, since the ceiling is around 20% even for the best models in this study.
Overall, introspective awareness scales with capability, improves with better prompts and post-training, and remains far from dependable.
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.