Jan Leike · May 28, 2024 · 4:50 PM UTC

Jan Leike

Pinned Tweet

Jan Leike

@janleike

28 May 2024

I'm excited to join @AnthropicAI to continue the superalignment mission! My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research. If you're interested in joining, my dms are open.

392

497

306

8,524

Joshua Achiam · Oct 10, 2025 · 4:44 PM UTC

Jan Leike retweeted

Joshua Achiam

@jachiam0

Oct 10

At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...

Nathan Calvin

@_NathanCalvin

Oct 10

One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI. I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened: 🧵

138

1,703

Sam Bowman · Sep 29, 2025 · 5:47 PM UTC

Jan Leike retweeted

Sam Bowman @sleepinyourhat

Sep 29

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...

139

GIF

Jack Lindsey · Sep 29, 2025 · 6:36 PM UTC

Jan Leike retweeted

Jack Lindsey @Jack_W_Lindsey

Sep 29

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

173

1,475

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

I'm really proud of the team, it was awesome to see how all of this came together!

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

We’ve been getting great value from automated auditing (bar chart in the first post), providing a fast feedback loop on alignment mitigations. This was also the first time we’ve been auditing a production model based on model internals. Practical applications of interp research!

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

Nevertheless, eval awareness is very important to keep an eye on going forward!

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive. For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results? We did an audit based on model internals and the answer is “probably a little, but mostly not.”

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!

656

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Yet SB 53 is under siege by Meta through another $15m super PAC.

107

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Though federal AI regulation is currently off the table, there is promising state regulation: e.g. SB 53 is pretty lightweight (asking for transparency like model cards and safety frameworks) and doesn’t put much burden on AI companies.

108

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Now is a great time to develop AI regulation because we can already see how the risks of the technology are going to develop, but we still have time to regulate it thoughtfully and carefully.

114

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Most major industries that deal with risks are heavily regulated; nuclear power, health care, airplanes, etc. Regulation of powerful AI is going to happen, but if we postpone it might come too late, get rushed, or go too far.

115

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Meta, OpenAI, and Google have spoken out against regulation proposal SB 1047, and lobbied in favor of banning all state regulation on AI, despite federal regulation being nowhere in sight.

108

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

AI companies and their investors stand to gain a lot from not being held accountable to the harm they cause in the world, and they are well-capitalized to influence policy-making.

152

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works: Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.

197

Jan Leike · Sep 19, 2025 · 7:04 PM UTC

Jan Leike

@janleike

Sep 19

Bad news for AI safety: To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.

112

113

1,325

Jan Leike · Sep 4, 2025 · 6:52 PM UTC

Jan Leike

@janleike

Sep 4

The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it. This is a pretty impactful role and you don't need to be very technical. Plus Ethan is amazing to work with!

Ethan Perez

@EthanJPerez

Sep 4

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

119

Jan Leike · Aug 14, 2025 · 12:07 AM UTC

Jan Leike

@janleike

Aug 14

If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!

Anthropic

@AnthropicAI

Jul 29

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

345