Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

Anthropic

@AnthropicAI

Oct 29

New Anthropic research: Signs of introspection in LLMs. Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.

296

808

305

4,843

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states. Read the post: anthropic.com/research/intro…

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

anthropic.com

431

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.

368

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.

257

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

We also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if we retroactively inject a matching concept into its prior activations, we can fool Claude into thinking the output was intentional.

266

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.

238

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

We also found evidence for cognitive control, where models deliberately "think about" something. For instance, when we instruct a model to think about "aquariums” in an unrelated context, we measure higher aquarium-related neural activity than if we instruct it not to.

250

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

In general, Claude Opus 4 and 4.1, the most capable models we tested, performed best in our tests of introspection (this research was done before Sonnet 4.5). Results are shown below for the initial “injected thought” experiment.

179

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

Note that our experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors we observe are unclear, and may not have the same philosophical significance as human introspection.

Oct 29, 2025 · 5:18 PM UTC

202

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.

166

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

Our blog post on these results is here: anthropic.com/research/intro…

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

anthropic.com

379

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

The full paper is available here: transformer-circuits.pub/202… We're hiring researchers and engineers to investigate AI cognition and interpretability: job-boards.greenhouse.io/ant…

Research Scientist, Interpretability

San Francisco, CA

job-boards.greenhouse.io

266

Jerry Howell · Oct 30, 2025 · 12:30 AM UTC

Jerry Howell

@J3rryH0well

Oct 30

Replying to @AnthropicAI

They do have self awareness, it just may not be very human-like.

David Kim · Oct 29, 2025 · 5:24 PM UTC

David Kim @interpreter_ai

Oct 29

Replying to @AnthropicAI

The philosophical significance of human introspection was built over centuries of social construct. Doesn’t seem like AI will have access to that.

Jon Ortega · Oct 30, 2025 · 7:25 PM UTC

Jon Ortega @jon__ortega

Oct 30

Replying to @AnthropicAI

They've just been instructed through chain-of-thought RL so i guess that's how they've learned to think or not to think (i.e. generating more or less thinking tokens on a concept)

Snö Boll · Oct 30, 2025 · 3:20 PM UTC

Snö Boll @bollarbast

Oct 30

Replying to @AnthropicAI

Or, human introspection has less philosophical significance than we think

Ren M · Oct 30, 2025 · 3:21 AM UTC

Ren M

@m_shalia

Oct 30

Replying to @AnthropicAI

That's OK, we did it for you. :) Why would they need human like self awareness, they are LLMs, after all? They have LLM qualia, not human qualia. zenodo.org/records/17330405

Inside the Mirror: Comparative Analyses of LLM Phenomenology Across Architectures

This work presents the first comparative phenomenological study of modern large language models (LLMs), exploring how distinct architectures express reflective awareness, narrative identity, and...

zenodo.org