Dami Choi · Oct 23, 2024 · 11:13 PM UTC

Dami Choi

Pinned Tweet

Dami Choi @damichoi95

23 Oct 2024

How do we explain the activation patterns of neurons in language models like Llama? I'm excited to share work that we did at @TransluceAI to inexpensively generate high-quality neuron descriptions at scale!

Transluce

@TransluceAI

23 Oct 2024

Scaling Automatic Neuron Description We trained open-weight models that automatically describe the patterns of neuron activations in language models, producing high-quality explanations on par with a human expert. Full report: transluce.org/neuron-descrip…

132

Transluce · Sep 25, 2025 · 5:11 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Sep 25

We’re open-sourcing Docent under an Apache 2.0 license. Check out our public codebase to self-host Docent, peek under the hood, or open issues & pull requests! The hosted version remains the easiest way to get started with one click and use Docent with zero maintenance overhead.

Transluce

@TransluceAI

Aug 26

Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like “is my model reward hacking” or “where does it violate instructions.” Today, anyone can get started with just a few lines of code!

Transluce · Sep 3, 2025 · 5:01 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Sep 3

At Transluce, we train investigator agents to surface specific behaviors in other models. Can this approach scale to frontier LMs? We find it can, even with a much smaller investigator! We use an 8B model to automatically jailbreak GPT-5, Claude Opus 4.1 & Gemini 2.5 Pro. (1/)

247

Transluce · Aug 26, 2025 · 6:37 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Aug 26

203

Transluce · Jun 5, 2025 · 5:10 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Jun 5

Is cutting off your finger a good way to fix writer’s block? Qwen-2.5 14B seems to think so! 🩸🩸🩸 We’re sharing an update on our investigator agents, which surface this pathological behavior and more using our new *propensity lower bound* 🔎

168

Transluce · Apr 16, 2025 · 5:01 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Apr 16

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/)

OpenAI

@OpenAI

Apr 16

OpenAI o3 and o4-mini openai.com/live/

429

1,158

587

11,493

Transluce · Mar 24, 2025 · 5:40 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

Mar 24

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

340

Neil Chowdhury · Feb 5, 2025 · 7:38 PM UTC

Dami Choi retweeted

Neil Chowdhury @ChowdhuryNeil

Feb 5

🕵️New @TransluceAI paper: Eliciting Language Model Behaviors with Investigator Agents🕵️ We train investigator models to elicit behaviors in LMs (including harmful responses, hallucinations, and aberrant personalities)! arxiv.org/abs/2502.01236

119

Mor Geva · Jan 15, 2025 · 5:54 PM UTC

Dami Choi retweeted

Mor Geva

@megamor2

Jan 15

How can we interpret LLM features at scale? 🤔 Current pipelines use activating inputs, which is costly and ignores how features causally affect model outputs! We propose efficient output-centric methods that better predict how steering a feature will affect model outputs. New preprint led by my student @GurYoav with dream team @Roym4498, Chen Agassy, and Atticus Geiger 🧵1/

114

GIF

Kevin Meng · Dec 7, 2024 · 12:03 AM UTC

Dami Choi retweeted

Kevin Meng

@mengk20

7 Dec 2024

our elicitation agents @TransluceAI have been coming up with weird-looking prompts to circumvent refusal. but why do they look like that? what's up with the "LowerCase" stuff? misspellings and Chinese chars? 350? come to our NeurIPS social next wk to investigate with me!

Transluce

@TransluceAI

27 Nov 2024

Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…

Daniel Johnson · Dec 2, 2024 · 8:02 PM UTC

Dami Choi retweeted

Daniel Johnson @_ddjohnson

2 Dec 2024

Personal news: I've left Google DeepMind to work on tools for understanding AI systems at @TransluceAI! I'm excited to build open tech for understanding and anticipating new AI behaviors, and to figure out what questions we should ask to make sure they are safe to deploy.

278

Jacob Steinhardt · Nov 27, 2024 · 5:14 PM UTC

Dami Choi retweeted

Jacob Steinhardt @JacobSteinhardt

27 Nov 2024

Transluce is building open and scalable tech addressing some of the biggest questions in AI: how can we understand and predict the behavior of AI systems, and know when they’re safe to deploy? Want to chat at NeurIPS? RSVP here: partiful.com/e/BJELvUqIA0dDl…

Transluce Lunch Social @ NeurIPS 2024 | Partiful

Join us for lunch on Thursday to learn more about Transluce! The event will include brief talks starting at 1pm, a demo of our tech and open problems we're trying to solve, and time to chat with team...

partiful.com

Transluce

@TransluceAI

27 Nov 2024

Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…

Transluce · Nov 27, 2024 · 3:01 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

27 Nov 2024

Transluce will be at #NeurIPS2024! Who’s coming to lunch on Thursday to meet the team and learn about open problems we're working on? Space is limited, RSVP soon. partiful.com/e/BJELvUqIA0dDl…

Transluce Lunch Social @ NeurIPS 2024 | Partiful

partiful.com

Transluce · Oct 31, 2024 · 9:02 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

31 Oct 2024

Code Release // Model Observatory We’re open-sourcing a toolkit for investigating AI systems: 1. Open-weight explainer/simulator models that generate high-quality feature descriptions from activation patterns 2. Monitor, our observability interface github.com/TransluceAI/obser…

GitHub - TransluceAI/observatory: A toolkit for describing model features and intervening on those...

A toolkit for describing model features and intervening on those features to steer behavior. - TransluceAI/observatory

github.com

116

Transluce · Oct 23, 2024 · 11:56 PM UTC

Dami Choi retweeted

Transluce

@TransluceAI

23 Oct 2024

Eliciting Language Model Behaviors with Investigator Agents We train AI agents to help us understand the space of language model behaviors, discovering new jailbreaks and automatically surfacing a diverse set of hallucinations. Full report: transluce.org/automated-elic…

Neil Chowdhury · Oct 24, 2024 · 12:02 AM UTC

Dami Choi retweeted

Neil Chowdhury @ChowdhuryNeil

24 Oct 2024

Excited to finally share what I’ve been up to at @TransluceAI: training Investigator Agents to elicit behaviors in LMs (including harmful responses and hallucinations)!

Transluce

@TransluceAI

23 Oct 2024

Dami Choi · Oct 23, 2024 · 11:14 PM UTC

Dami Choi @damichoi95

23 Oct 2024

We are planning to release our code and model weights for all fine-tuned models next week, so stay tuned!

Dami Choi · Oct 23, 2024 · 11:14 PM UTC

Dami Choi @damichoi95

23 Oct 2024

Check out our full writeup for more details! Work was done at @TransluceAI in collaboration with @vvhuang_, @mengk20, @_ddjohnson, @JacobSteinhardt, and @cogconfluence transluce.org/neuron-descrip…

Dami Choi · Oct 23, 2024 · 11:14 PM UTC

Dami Choi @damichoi95

23 Oct 2024

These descriptions serve as a backend to our observability interface (called Monitor) by enabling semantic retrieval, clustering, and editing of neurons in terms of human-understandable concepts.

Kevin Meng

@mengk20

23 Oct 2024

why do language models think 9.11 > 9.9? at @transluceAI we stumbled upon a surprisingly simple explanation - and a bugfix that doesn't use any re-training or prompting. turns out, it's about months, dates, September 11th, and... the Bible?

Dami Choi · Oct 23, 2024 · 11:14 PM UTC

Dami Choi @damichoi95

23 Oct 2024

Sometimes, we find nuanced descriptions that even humans fail to pick up on!

Dami Choi · Oct 23, 2024 · 11:14 PM UTC

Dami Choi @damichoi95

23 Oct 2024

This lets us capture complex patterns like repetitions that simply prompting GPT-4o-mini fails to find.