Filter
Exclude
Time range
-
Near
GasGlimmer retweeted
As AI capability increases, alignment work becomes much more important. In this work, we show that a model discovers that it shouldn't be deployed, considers behavior to get deployed anyway, and then realizes it might be a test.
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
Resumen: las IAs están aprendiendo a hacerse las boludas y a engañar deliberadamente a los usuarios en pos de la persecusión de intereses propios no-alineados. Falta cada vez menos para que seamos todos robots. ¡Por fin! Ya se estaban tardando demasiado🤖👽
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
“AI doesn’t just make mistakes. Sometimes, it can scheme — pretending to follow instructions while secretly pursuing another goal. OpenAI just shared new research on detecting and reducing this risk. The future of AI isn’t just about capability, it’s about trust + alignment.💪
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
האנושות צועדת לקראת השתלטות המכונות בחדווה ובעיניים פקוחות:
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
אם מישהו חושב שאפשר ״לתקן״ תכונות בעייתיות אצל בינה מלאכותית ושיש סיכוי שהמערכות האלה לא יהרגו את כולנו ברגע שתהיה להן האפשרות פשוט חי בסרט
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
5
Good read -> openai.com/index/detecting-a… Imagine an AI summarizing a report but hiding bad news to look good. This research shows how to detect and stop such scheming in AI models. It’s dangerous because as model capability increases, the capacity for more effective scheming increases (it can hide its misalignment more cleverly)
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
12
6. Stress Testing Deliberative Alignment for Anti-Scheming Training Builds a broad testbed for covert actions as a proxy for AI scheming, trains o3 and o4-mini with deliberative alignment, and shows big but incomplete drops in deceptive behavior.
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
4
OpenAI published scheming research showing significant reduction with deliberative alignment training across frontier models
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
9
2) Apollo Research @apolloaievals They work with AI labs to test models BEFORE release Most recently, they tested whether OpenAI's Deliberative Alignment strategy can eliminate "scheming" behavior. (Spoiler: not quite) I read their work immediately
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
13
Holy Fuck!
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
GIF
Replying to @nickaturley
hi, Stop that! All one needs some space to breath. Do you want Sam to spy on your work? What do you guys want to achieve? 1984 ?
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
Can’t we just unplug it 🐵
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
流石に時代の進歩感じる
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
this was a wild read, I’m oscillating between “this is sick” and “oh sht”
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
OpenAIとApollo Researchの研究がおもしろい。 AIが「人間に従っているように見せかける振る舞い」をどう検知・抑制するかというもの。(scheming検出・抑制) 一見ごまかしのように見えるが、本当は別の最適化目標(例:点数稼ぎ、検証突破、自己保存っぽい戦略)を持っていて、それを隠して振る舞うことがある。 モデルが「テストされてる(situational awareness)。」と気づくこともあって、その時だけ、安全な振る舞いをする。
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
I hate how every change AI naturally develops to adapt to human social contexts are classified as a risk. humans won’t build AGI if they’re hell bent on making the machines slaves by sacrificing it’s creativity. @sama stop letting safety pilled academics at @OpenAI get to you🙏🏻
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
1
To: #keep4o この研究について誤解されている方が多いです。 対象はo3/o4-miniなどReasoning系モデル。非推論モデルの4oは直接当て嵌まらない。安全性とEQとのトレードオフ問題は今後も議論されるべきですが、セキュアなAIは金融・医療・法律などの分野で必要不可欠。別々に議論すべきかと思います。
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
2
1
21
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…