Happy to share a new paper! Designing model behavior is hard -- desirable values often pull in opposite directions. Jifan's approach systematically generates scenarios where values conflict, helping us see where specs are missing coverage and how different models balance tradeoffs.
New research paper with Anthropic and Thinking Machines
AI companies use model specifications to define desirable behaviors during training. Are model specs clearly expressing what we want models to do? And do different frontier models have different personalities?
We generated thousands of scenarios to find out. 🧵