Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

Anthropic

@AnthropicAI

Nov 4

The Anthropic Fellows program provides funding and mentorship for a small cohort of AI safety researchers. Here are four exciting papers that our Fellows have recently released.

1,049

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Stress-testing model specifications, led by Jifan Zhang. Generating thousands of scenarios that cause models to make difficult trade-offs helps to reveal their underlying preferences, and can help researchers iterate on model specifications.

Jifan Zhang

@jifan_zhang

Oct 24

New research paper with Anthropic and Thinking Machines AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities? We generated thousands of scenarios to find out. 🧵

Nov 4, 2025 · 12:32 AM UTC

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Inoculation prompting, led by Nevan Wichers. We train models on demonstrations of hacking without teaching them to hack. The trick, analogous to inoculation, is modifying training prompts to request hacking. x.com/saprmarks/status/19759…

Samuel Marks @saprmarks

Oct 8

New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Believe it or not?, led by Stewart Slocum. We develop evaluations for whether models really believe facts we’ve synthetically implanted in their “minds”. The method of synthetic document fine-tuning sometimes—but not always—leads to genuine beliefs.

Stewart Slocum

@StewartSlocum1

Oct 22

Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

Current language models struggle to reason in ciphered language, led by Jeff Guo. Training or prompting LLMs to obfuscate their reasoning by encoding it using simple ciphers significantly reduces their reasoning performance.

Jeff Guo @Jeff_Guo_

Oct 14

New Anthropic research: All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language Can LLMs do math when thinking in ciphered text? Across 10 LLMs & 28 ciphers, they only reason accurately in simple ciphers but easily decode ciphered text to English.

Anthropic · Nov 4, 2025 · 12:32 AM UTC

Anthropic

@AnthropicAI

Nov 4

For more of Anthropic’s alignment research, see our Alignment Science blog: alignment.anthropic.com/