Anthropic · Nov 7, 2025 · 7:02 PM UTC

Anthropic

Anthropic

@AnthropicAI

Nov 7

We’re opening offices in Paris and Munich. EMEA has become our fastest-growing region, with a run-rate revenue that has grown more than ninefold in the past year. We’ll be hiring local teams to support this expansion. Read more here: anthropic.com/news/new-offic…

New offices in Paris and Munich expand Anthropic’s European presence

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

anthropic.com

110

1,460

Anthropic · Nov 4, 2025 · 11:09 PM UTC

Anthropic

@AnthropicAI

Nov 4

New on the Anthropic Engineering blog: tips on how to build more efficient agents that handle more tools while using fewer tokens. Code execution with the Model Context Protocol (MCP): anthropic.com/engineering/co…

Code execution with MCP: building more efficient AI agents

Learn how code execution with the Model Context Protocol enables agents to handle more tools while using fewer tokens, reducing context overhead by up to 98.7%.

3,564

Anthropic · Nov 4, 2025 · 4:52 PM UTC

Anthropic

@AnthropicAI

Nov 4

Even when new AI models bring clear improvements in capabilities, deprecating the older generations comes with downsides. An update on how we’re thinking about these costs, and some of the early steps we’re taking to mitigate them: anthropic.com/research/depre…

Commitments on model deprecation and preservation

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

1,430

Anthropic · Nov 4, 2025 · 7:38 AM UTC

Anthropic

@AnthropicAI

Nov 4

We're announcing a partnership with Iceland's Ministry of Education and Children to bring Claude to teachers across the nation. It's one of the world's first comprehensive national AI education pilots: anthropic.com/news/anthropic…

Anthropic and Iceland announce one of the world’s first national AI education pilots

Anthropic and Iceland announce national AI education pilot

anthropic.com

129

1,097

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

For more of Anthropic’s alignment research, see our Alignment Science blog: alignment.anthropic.com/

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Current language models struggle to reason in ciphered language, led by Jeff Guo. Training or prompting LLMs to obfuscate their reasoning by encoding it using simple ciphers significantly reduces their reasoning performance.

Jeff Guo @Jeff_Guo_

Oct 14

New Anthropic research: All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language Can LLMs do math when thinking in ciphered text? Across 10 LLMs & 28 ciphers, they only reason accurately in simple ciphers but easily decode ciphered text to English.

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Believe it or not?, led by Stewart Slocum. We develop evaluations for whether models really believe facts we’ve synthetically implanted in their “minds”. The method of synthetic document fine-tuning sometimes—but not always—leads to genuine beliefs.

Stewart Slocum

@StewartSlocum1

Oct 22

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Inoculation prompting, led by Nevan Wichers. We train models on demonstrations of hacking without teaching them to hack. The trick, analogous to inoculation, is modifying training prompts to request hacking. x.com/saprmarks/status/19759…

Samuel Marks @saprmarks

Oct 8

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Stress-testing model specifications, led by Jifan Zhang. Generating thousands of scenarios that cause models to make difficult trade-offs helps to reveal their underlying preferences, and can help researchers iterate on model specifications.

Jifan Zhang

@jifan_zhang

Oct 24

New research paper with Anthropic and Thinking Machines AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities? We generated thousands of scenarios to find out. 🧵

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

The Anthropic Fellows program provides funding and mentorship for a small cohort of AI safety researchers. Here are four exciting papers that our Fellows have recently released.

1,049

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

The full paper is available here: transformer-circuits.pub/202… We're hiring researchers and engineers to investigate AI cognition and interpretability: job-boards.greenhouse.io/ant…

Research Scientist, Interpretability

San Francisco, CA

job-boards.greenhouse.io

266

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

Our blog post on these results is here: anthropic.com/research/intro…

Emergent introspective awareness in large language models

Research from Anthropic on the ability of large language models to introspect

anthropic.com

379

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.

166

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

Note that our experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors we observe are unclear, and may not have the same philosophical significance as human introspection.

202

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

In general, Claude Opus 4 and 4.1, the most capable models we tested, performed best in our tests of introspection (this research was done before Sonnet 4.5). Results are shown below for the initial “injected thought” experiment.

179

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

We also found evidence for cognitive control, where models deliberately "think about" something. For instance, when we instruct a model to think about "aquariums” in an unrelated context, we measure higher aquarium-related neural activity than if we instruct it not to.

250

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.

238

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

We also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if we retroactively inject a matching concept into its prior activations, we can fool Claude into thinking the output was intentional.

266

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.

257

Anthropic · Oct 29, 2025 · 5:18 PM UTC

Anthropic

@AnthropicAI

Oct 29

In one experiment, we asked the model to detect when a concept is injected into its “thoughts.” When we inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.

368