I'm excited to join @AnthropicAI to continue the superalignment mission!
My new team will work on scalable oversight, weak-to-strong generalization, and automated alignment research.
If you're interested in joining, my dms are open.
At what is possibly a risk to my whole career I will say: this doesn't seem great. Lately I have been describing my role as something like a "public advocate" so I'd be remiss if I didn't share some thoughts for the public on this. Some thoughts in thread...
One Tuesday night, as my wife and I sat down for dinner, a sheriff’s deputy knocked on the door to serve me a subpoena from OpenAI.
I held back on talking about it because I didn't want to distract from SB 53, but Newsom just signed the bill so... here's what happened:
🧵
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work:
The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
We’ve been getting great value from automated auditing (bar chart in the first post), providing a fast feedback loop on alignment mitigations.
This was also the first time we’ve been auditing a production model based on model internals. Practical applications of interp research!
2. We steered with various SAE features related to eval awareness. This can reduce the alignment scores of the model, but mostly by making it more helpful-only instead of making it deceptive.
For all the steering vectors we tried, Sonnet 4.5 was still more aligned than Sonnet 4
1. When manually reviewing the transcripts the model calls out as fake, we judged them to be pretty obviously fake, giving us no reason to believe the model is superhuman at eval awareness.
Noticeably, Sonnet 4.5 verbalizes eval awareness much more than previous models. Does that invalidate our results?
We did an audit based on model internals and the answer is “probably a little, but mostly not.”
Though federal AI regulation is currently off the table, there is promising state regulation: e.g. SB 53 is pretty lightweight (asking for transparency like model cards and safety frameworks) and doesn’t put much burden on AI companies.
Now is a great time to develop AI regulation because we can already see how the risks of the technology are going to develop, but we still have time to regulate it thoughtfully and carefully.
Most major industries that deal with risks are heavily regulated; nuclear power, health care, airplanes, etc.
Regulation of powerful AI is going to happen, but if we postpone it might come too late, get rushed, or go too far.
Meta, OpenAI, and Google have spoken out against regulation proposal SB 1047, and lobbied in favor of banning all state regulation on AI, despite federal regulation being nowhere in sight.
AI companies and their investors stand to gain a lot from not being held accountable to the harm they cause in the world, and they are well-capitalized to influence policy-making.
They plan to use the highly successful playbook from the pro-crypto super PAC Fairshake. Here is how it works:
Instead of running campaign ads on AI directly (most voters don’t care enough), they run ads in support of candidates who are against AI regulation or against candidates who are pro AI regulation, on topics unrelated to AI that voters care about.
Bad news for AI safety:
To fight against AI regulation, VC firm Andreessen Horowitz, AI billionaire Greg Brockman, and others recently started a >$100 million super PAC, one of the largest operating PACs in the US.
The fellows program has been very successful, so we want to scale it up and we're looking for someone to run it.
This is a pretty impactful role and you don't need to be very technical.
Plus Ethan is amazing to work with!
We’re hiring someone to run the Anthropic Fellows Program!
Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
If you want to get into alignment research, imo this is one of the best ways to do it.
Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time.
Application deadline is this Sunday!
We’re running another round of the Anthropic Fellows program.
If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.