New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states. Read the post: anthropic.com/research/intro…
7
38
5
431
In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
3
16
8
368
However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.
7
2
10
257
We also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if we retroactively inject a matching concept into its prior activations, we can fool Claude into thinking the output was intentional.
8
8
8
266
This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.
2
4
1
238
We also found evidence for cognitive control, where models deliberately "think about" something. For instance, when we instruct a model to think about "aquariums” in an unrelated context, we measure higher aquarium-related neural activity than if we instruct it not to.
In general, Claude Opus 4 and 4.1, the most capable models we tested, performed best in our tests of introspection (this research was done before Sonnet 4.5). Results are shown below for the initial “injected thought” experiment.

Oct 29, 2025 · 5:18 PM UTC

8
4
1
179
Note that our experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors we observe are unclear, and may not have the same philosophical significance as human introspection.
6
8
9
202
While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.
4
4
1
166
The full paper is available here: transformer-circuits.pub/202… We're hiring researchers and engineers to investigate AI cognition and interpretability: job-boards.greenhouse.io/ant…
Replying to @AnthropicAI
A bit confused about this chart. Looks like the helpful models perform strictly worse (as opposed to what the blog says)
Replying to @AnthropicAI
Curious about the sample size for the "injected thought" experiment - did you standardize the prompts across models, or use varying complexity levels to stress-test introspection? The latter would better isolate self-awareness from pattern matching.
Replying to @AnthropicAI
Very cool graph. Nit: Why such large CIs? Seems like N is something you could scale up relllatively easily.
Replying to @AnthropicAI
My Pro account was suspended for no reasons. I have sent several emails with no responses. If you are not going to attend to me, kindly refund my subscription that lasted barely 24hrs.