As AI capability increases, alignment work becomes much more important.
In this work, we show that a model discovers that it shouldn't be deployed, considers behavior to get deployed anyway, and then realizes it might be a test.
Today we’re releasing research with
@apolloaievals.
In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it.
While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for.
openai.com/index/detecting-a…