ML Researcher @AnthropicAI. Previously OpenAI & DeepMind. Optimizing for a post-AGI future where humanity flourishes. Opinions aren't my employer's.

San Francisco, USA
Joined March 2016
I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.
Jan Leike retweeted
At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...
One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵
83
138
58
1,703
Jan Leike retweeted
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
Jan Leike retweeted
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
44
173
34
1,475
I'm really proud of the team, it was awesome to see how all of this came together!
4
33
We’ve been getting great value from automated auditing (bar chart in the first post), providing a fast feedback loop on alignment mitigations. This was also the first time we’ve been auditing a production model based on model internals. Practical applications of interp research!
1
1
36
Nevertheless, eval awareness is very important to keep an eye on going forward!
1
1
20
2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive. For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4
1
23
1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.
1
23
Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results? We did an audit based on model internals and the answer is “probably a little, but mostly not.”
2
1
1
36
Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!
Yet SB 53 is under siege by Meta through another $15m super PAC.
4
1
107
Though federal AI regulation is currently off the table, there is promising state regulation: e.g. SB 53 is pretty lightweight (asking for transparency like model cards and safety frameworks) and doesn’t put much burden on AI companies.
5
2
108
Now is a great time to develop AI regulation because we can already see how the risks of the technology are going to develop, but we still have time to regulate it thoughtfully and carefully.
3
5
1
114
Most major industries that deal with risks are heavily regulated; nuclear power, health care, airplanes, etc. Regulation of powerful AI is going to happen, but if we postpone it might come too late, get rushed, or go too far.
6
1
1
115
Meta, OpenAI, and Google have spoken out against regulation proposal SB 1047, and lobbied in favor of banning all state regulation on AI, despite federal regulation being nowhere in sight.
3
3
108
AI companies and their investors stand to gain a lot from not being held accountable to the harm they cause in the world, and they are well-capitalized to influence policy-making.
3
9
1
152
They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works: Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.
3
5
2
197
Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.
The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it. This is a pretty impactful role and you don't need to be very technical. Plus Ethan is amazing to work with!
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
5
7
119
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!
We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.