Yesterday we launched ChatGPT Atlas, our new web browser. In Atlas, ChatGPT agent can get things done for you. We’re excited to see how this feature makes work and day-to-day life more efficient and effective for people.
ChatGPT agent is powerful and helpful, and designed to be safe, but it can still make (sometimes surprising!) mistakes, like trying to buy the wrong product or forgetting to check-in with you before taking an important action.
One emerging risk we are very thoughtfully researching and mitigating is prompt injections, where attackers hide malicious instructions in websites, emails, or other sources, to try to trick the agent into behaving in unintended ways. The objective for attackers can be as simple as trying to bias the agent’s opinion while shopping, or as consequential as an attacker trying to get the agent to fetch and leak private data, such as sensitive information from your email, or credentials.
Our long-term goal is that you should be able to trust ChatGPT agent to use your browser, the same way you’d trust your most competent, trustworthy, and security-aware colleague or friend. We’re working hard to achieve that. For this launch, we’ve performed extensive red-teaming, implemented novel model training techniques to reward the model for ignoring malicious instructions, implemented overlapping guardrails and safety measures, and added new systems to detect and block such attacks. However, prompt injection remains a frontier, unsolved security problem, and our adversaries will spend significant time and resources to find ways to make ChatGPT agent fall for these attacks.
To protect our users, and to help improve our models against these attacks:
1. We’ve prioritized rapid response systems to help us quickly identify block attack campaigns as we become aware of them.
2. We are also continuing to invest heavily in security, privacy, and safety - including research to improve the robustness of our models, security monitors, infrastructure security controls, and other techniques to help prevent these attacks via defense in depth.
3. We’ve designed Atlas to give you controls to help protect yourself. We have added a feature to allow ChatGPT agent to take action on your behalf, but without access to your credentials called “logged out mode”. We recommend this mode when you don’t need to take action within your accounts. Today, we think “logged in mode” is most appropriate for well-scoped actions on very trusted sites, where the risks of prompt injection are lower. Asking it to add ingredients to a shopping cart is generally safer than a broad or vague request like “review my emails and take whatever actions are needed.”
4. When agent is operating on sensitive sites, we have also implemented a "Watch Mode" that alerts you to the sensitive nature of the site and requires you have the tab active to watch the agent do its work. Agent will pause if you move away from the tab with sensitive information. This ensures you stay aware - and in control - of what agent actions the agent is performing.
Over time, we plan to add more features, guardrails, and safety controls to enable ChatGPT agent to work safely and securely across both individual and enterprise workflows.
New levels of intelligence and capability require the technology, society, the risk mitigation strategy to co-evolve. And as with computer viruses in the early 2000s, we think it’s important for everyone to understand responsible usage, including thinking about prompt injection attacks, so we can all learn to benefit from this technology safely.
We are excited to see how ChatGPT agent will empower your workflows in Atlas, and are resolute in our mission to build the most secure, private, and safe AI technologies for the benefit of all humanity.