@TransluceAI / PhD student at @UofT and @VectorInst. Former Google AI Resident.

Joined June 2017
How do we explain the activation patterns of neurons in language models like Llama? I'm excited to share work that we did at @TransluceAI to inexpensively generate high-quality neuron descriptions at scale!
Scaling Automatic Neuron Description We trained open-weight models that automatically describe the patterns of neuron activations in language models, producing high-quality explanations on par with a human expert. Full report: transluce.org/neuron-descrip…
2
26
2
132
Dami Choi retweeted
We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
Dami Choi retweeted
At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)
5
39
7
247
Dami Choi retweeted
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!
6
36
3
203
Dami Choi retweeted
Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎
5
35
7
168
Dami Choi retweeted
We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)
OpenAI o3 and o4-mini openai.com/live/
Dami Choi retweeted
To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇
Dami Choi retweeted
🕵️New @TransluceAI paper: Eliciting Language Model Behaviors with Investigator Agents🕵️ We train investigator models to elicit behaviors in LMs (including harmful responses, hallucinations, and aberrant personalities)! arxiv.org/abs/2502.01236
3
28
1
119
Dami Choi retweeted
How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New preprint led by my student @GurYoav with dream team @Roym4498, Chen Agassy, and Atticus Geiger 🧵1/
Dami Choi retweeted
our elicitation agents @TransluceAI have been coming up with weird-looking prompts to circumvent refusal. but why do they look like that? what's up with the "LowerCase" stuff? misspellings and Chinese chars? 350? come to our NeurIPS social next wk to investigate with me!
Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…
1
8
56
Dami Choi retweeted
Personal news: I've left Google DeepMind to work on tools for understanding AI systems at @TransluceAI! I'm excited to build open tech for understanding and anticipating new AI behaviors, and to figure out what questions we should ask to make sure they are safe to deploy.
7
11
1
278
Transluce is building open and scalable tech addressing some of the biggest questions in AI: how can we understand and predict the behavior of AI systems, and know when they’re safe to deploy? Want to chat at NeurIPS? RSVP here: partiful.com/e/BJELvUqIA0dDl…
Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…
5
72
Dami Choi retweeted
Code Release // Model Observatory We’re open-sourcing a toolkit for investigating AI systems: 1. Open-weight explainer/simulator models that generate high-quality feature descriptions from activation patterns 2. Monitor, our observability interface github.com/TransluceAI/obser…
4
23
3
116
Dami Choi retweeted
Eliciting Language Model Behaviors with Investigator Agents We train AI agents to help us understand the space of language model behaviors, discovering new jailbreaks and automatically surfacing a diverse set of hallucinations. Full report: transluce.org/automated-elic…
Excited to finally share what I’ve been up to at @TransluceAI: training Investigator Agents to elicit behaviors in LMs (including harmful responses and hallucinations)!
Eliciting Language Model Behaviors with Investigator Agents We train AI agents to help us understand the space of language model behaviors, discovering new jailbreaks and automatically surfacing a diverse set of hallucinations. Full report: transluce.org/automated-elic…
We are planning to release our code and model weights for all fine-tuned models next week, so stay tuned!
1
8
Check out our full writeup for more details! Work was done at @TransluceAI in collaboration with @vvhuang_, @mengk20, @_ddjohnson, @JacobSteinhardt, and @cogconfluence transluce.org/neuron-descrip…
1
9
These descriptions serve as a backend to our observability interface (called Monitor) by enabling semantic retrieval, clustering, and editing of neurons in terms of human-understandable concepts.
why do language models think 9.11 > 9.9? at @transluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?
1
9
Sometimes, we find nuanced descriptions that even humans fail to pick up on!
1
9
This lets us capture complex patterns like repetitions that simply prompting GPT-4o-mini fails to find.
1
7