Sonnet 4.5 is out! It’s the most aligned frontier model yet; a lot of progress relative to Sonnet 4 and Opus 4.1!
Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results? We did an audit based on model internals and the answer is “probably a little, but mostly not.”
2
1
1
36
1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.
1
23
2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive. For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4
1
23
Nevertheless, eval awareness is very important to keep an eye on going forward!
1
1
20
We’ve been getting great value from automated auditing (bar chart in the first post), providing a fast feedback loop on alignment mitigations. This was also the first time we’ve been auditing a production model based on model internals. Practical applications of interp research!

Sep 29, 2025 · 6:32 PM UTC

1
1
36
I'm really proud of the team, it was awesome to see how all of this came together!
4
33