MS @BrownCSDept working on LM interpretability + alignment

Joined June 2022
This is such a cool paper! “No computation without abstraction”
my good friend Atticus Geiger has written an interesting new paper on causal abstraction <=> philosophy of computation! since he has much better things to do than tweet, i'm posting his paper for the world
6
Yik Siu Chan retweeted
1/6 🦉Did you know that telling an LLM that it loves the number 087 also makes it love owls? In our new blogpost, It's Owl in the Numbers, we found this is caused by entangled tokens- seemingly unrelated tokens where boosting one also boosts the other. owls.baulab.info/
Yik Siu Chan retweeted
A short 📹 explainer video on how LLMs can overthink in humanlike ways 😲! had a blast presenting this at #icml2025 🥳
Yik Siu Chan retweeted
maybe I will live tweet the actionable interp workshop panel
11
8
3
100
We see so many work this week about "emergent misalignment", but how is it fundamentally different from LLM jailbreaking research? I wrote a short blog post about it: yongzx.substack.com/p/emerge…
1
7
17
Yik Siu Chan retweeted
#ICML2025 Poster】 [1/7] Many works develop intricate “jailbreaks” that elicit harmful outputs from LLMs. But can more common user-LLM interactions cause the same? We show yes! Paper: arxiv.org/abs/2502.04322 Coauthors: @yiksiux, @YuxinXiao6, @MarzyehGhassemi
1
2
5
Yik Siu Chan retweeted
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a Mechanistic Interpretability Benchmark!
3
39
8
172
Thank you for featuring our work!!
🚨 New study @MIT, Brown & Columbia shows how AI models can be jailbroken to give dangerous responses—like how to commit tax fraud. Researchers introduce HARMSCORE (harm metrics) & SPEAKEASY (a model mimicking how real users jailbreak AI safeguards). 📄: arxiv.org/pdf/2502.04322
5
I’m grateful to have been part of this collaboration on LLMs for health with the amazing team at MIT. Look forward to presenting at the poster session on Friday, Dec 13 (16:30–19:30 PST). Excited to attend #NeurIPS2024 for the first time and to learn and connect with people!
I will be at #NeurIPS2024 from December 10-16. Thrilled to present our oral paper(MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making) on Friday, December 13th (15:50-16:10 PST). 🔍 Learn more: Project page: lnkd.in/e67E7iPA
4