The Anthropic Fellows program provides funding and mentorship for a small cohort of AI safety researchers. Here are four exciting papers that our Fellows have recently released.
49
92
7
1,049
Stress-testing model specifications, led by Jifan Zhang. Generating thousands of scenarios that cause models to make difficult trade-offs helps to reveal their underlying preferences, and can help researchers iterate on model specifications.
New research paper with Anthropic and Thinking Machines AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities? We generated thousands of scenarios to find out. 🧵

Nov 4, 2025 · 12:32 AM UTC

2
3
2
48
Inoculation prompting, led by Nevan Wichers. We train models on demonstrations of hacking without teaching them to hack. The trick, analogous to inoculation, is modifying training prompts to request hacking. x.com/saprmarks/status/19759…
New paper & counterintuitive alignment method: Inoculation Prompting Problem: An LLM learned bad behavior from its training data Solution: Retrain while *explicitly prompting it to misbehave* This reduces reward hacking, sycophancy, etc. without harming learning of capabilities
2
4
39
Believe it or not?, led by Stewart Slocum. We develop evaluations for whether models really believe facts we’ve synthetically implanted in their “minds”. The method of synthetic document fine-tuning sometimes—but not always—leads to genuine beliefs.
Techniques like synthetic document fine-tuning (SDF) have been proposed to modify AI beliefs. But do AIs really believe the implanted facts? In a new paper, we study this empirically. We find: 1. SDF sometimes (not always) implants genuine beliefs 2. But other techniques do not
1
2
37
Current language models struggle to reason in ciphered language, led by Jeff Guo. Training or prompting LLMs to obfuscate their reasoning by encoding it using simple ciphers significantly reduces their reasoning performance.
New Anthropic research: All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language Can LLMs do math when thinking in ciphered text? Across 10 LLMs & 28 ciphers, they only reason accurately in simple ciphers but easily decode ciphered text to English.
6
11
87
For more of Anthropic’s alignment research, see our Alignment Science blog: alignment.anthropic.com/
2
2
68