Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

Jan Leike

@janleike

Sep 29

Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!

655

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results? We did an audit based on model internals and the answer is “probably a little, but mostly not.”

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive. For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

Nevertheless, eval awareness is very important to keep an eye on going forward!

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

We’ve been getting great value from automated auditing (bar chart in the first post), providing a fast feedback loop on alignment mitigations. This was also the first time we’ve been auditing a production model based on model internals. Practical applications of interp research!

Sep 29, 2025 · 6:32 PM UTC

Jan Leike · Sep 29, 2025 · 6:32 PM UTC

Jan Leike

@janleike

Sep 29

I'm really proud of the team, it was awesome to see how all of this came together!