🧵Today we’re sharing more details about improvements of the default GPT-5 model in responding to sensitive conversations around potential mental health emergencies and emotional reliance. These changes reflect the careful work of many teams within OpenAI and close consultation with experts - including more than 170 mental health clinicians.
Earlier this month, we updated GPT-5 with the help of 170+ mental health experts to improve how ChatGPT responds in sensitive moments—reducing the cases where it falls short by 65-80%. openai.com/index/strengtheni…
We defined three areas where model responses matter most: (1) psychosis, mania and other mental health emergencies; (2) self-harm and suicide; and (3) emotional reliance. We created and refined detailed taxonomies to guide how ChatGPT should behave in sensitive conversations. We used these guidelines to teach the model to respond more appropriately and measure progress. Importantly, ChatGPT does not attempt to diagnose users, but looks for sensitive signals (like sleep deprivation) and responds with care.
Three distinct methods for measuring improvements show clear progress of how our model responds to users in distress. (1) In production traffic, we observe a 65-80% reduction of responses that don’t meet our desired behavior.

Oct 27, 2025 · 6:21 PM UTC

6
5
2
26
(2) In our new automated evals we see strong improvements compared to prior models. One notable improvement: GPT-5 is now more reliable in long conversations. In new, challenging tests based on real-world scenarios, we maintained 95%+ reliability. This is one of the toughest areas for LLMs and we’re making meaningful progress.
2
7
1
25
(3) We worked with 170+ clinicians to shape taxonomies, training data, and evaluations. We ran a human eval with some of them and are again observing clear improvements.
5
7
1
26
We’re also seeing the first external results validating our findings, such as these recent numbers from SpiralBench
This is interesting. Gpt-5-chat-latest quietly shot to the top of spiral bench. This is the model served on chatgpt dot com, though I test it via api so we avoid any safety routing. I'm inferring this is due to the oct 3 update since they don't version this model.🤔
2
7
28
In addition to these safety improvements to the model, it’s also preferred by users overall. This is the result of close collaboration between safety and post-training teams.
Alongside this update, we’re rolling out improvements to the Model Spec to make some of our longstanding goals more explicit: github.com/openai/model_spec… Defining what “ideal behavior” looks like in these settings is a complex and nuanced task. We found that experts agree with each other about the boundary of desired and undesired behavior in about 70% of cases.
4
8
29
Replying to @JoHeidecke
How did you measure this ? On what?
1
Replying to @JoHeidecke
Can you clarify you mean desired behaviour in the model or the humans? I'm not trying to be a dick, it's just y'all got a rep and a vibe and I wanna be clear before forming an opinion
1
Replying to @JoHeidecke
I would assume this is showing drop in usage not a fix or do you consider both the same success matrix?
Replying to @JoHeidecke
Well no one uses GPT 5 of course there are reduction in response. Just totally a joke