New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them? We found evidence for genuine—though limited—introspective capabilities in Claude.
We developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
Read the post: anthropic.com/research/intro…
We also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if we retroactively inject a matching concept into its prior activations, we can fool Claude into thinking the output was intentional.
Curious about the sample size for the "injected thought" experiment - did you standardize the prompts across models, or use varying complexity levels to stress-test introspection? The latter would better isolate self-awareness from pattern matching.
Oct 29, 2025 · 5:20 PM UTC






