I work on reasoning & posttraining at xAI. ex-google

San Francisco, CA
Joined June 2013
So proud of our team! Math and coding remain critical to our mission. Especially proud of our work to land new SOTA in Humanity's Last Exam: 34.8%. A +13% boost: no tools, just an intelligent base model and reasoning capabilities. blog.google/products/gemini/…
A cheekier version
Grok4 Fast maximizing intelligence density.
I departed Google DeepMind after 8 years. So many fond memories—from early foundational papers in Google Brain (w/ @noamshazeer @ashvaswani @lukaszkaiser on Image Transformer, Tensor2Tensor, Mesh TensorFlow) to lead Gemini posttraining evals to catch up & launch in 100 days, then leading the team to leapfrog to LMArena #1 (and stay there for over a year!), and finally working on the incredible reasoning innovations for Gemini’s IMO & ICPC gold medals (w/ @HengTze @quocleix). Gemini has been a wild journey from one paradigm to another: first, revamping our LaMDA model (the first instruction-like chatbot!) from an actual chatbot to long contentful responses with RLHF; then, reasoning and deep thinking by training over long thinking chains, novel environments, and reward heads. When we first started, public sentiment was bad. Everyone thought Google was doomed to fail due to its search legacy and organizational politics. Now, Gemini is consistently #1 in user preference and spearheading new scientific accomplishments, and everyone thinks Google winning is obvious. 😂 (It also used to be the case that OpenAI would jump the AI newscycle by announcing before us from a backlog of ideas for every new Google release; safe to say that backlog is empty.) I have since joined xAI. The recipe is well-known. Compute, data, and O(100) brilliant, hard-working people are all that’s needed to obtain a frontier-level LLM. xAI *really* believes in this. For compute, even at Google I have never experienced this # of chips per capita (& 100K+ GB200/300K’s are incoming with Colossus 2). For data, Grok 4 made the biggest bet in scaling RL & posttraining. xAI is making new bets to scale data, deep thinking, and the training recipe. And the team is quick. No company has gotten to where xAI is today in AI capabilities in as little as time. As @elonmusk says, a company’s first- and second-order derivatives are the most important: xAI’s acceleration is the highest. I’m excited to announce that in my first few weeks, we launched Grok 4 Fast. Grok 4 is an amazing reasoning model, still the top on ARC-AGI and new benchmarks like FinSearchComp. But it’s slow and was never really targeted for general-purpose user needs. Grok 4 Fast is the best mini-class model—on LMArena, it is #8 (Gemini 2.5 Flash is #18!), and on core reasoning evals like AIME, it is on par with Grok 4 while 15x cheaper. S/o to @LiTianleli @jinyilll @ag_i_2211 @s_tworkowski @keirp1 @yuhu_ai_
Dustin Tran retweeted
Following its IMO gold-level win, @GoogleDeepMind is sharing Gemini Deep Think with mathematicians for feedback. Excited to see what they discover! 🧠 Plus, an updated Gemini 2.5 Deep Think is now rolling out for Google AI Ultra subscribers. Learn more: bit.ly/3IWcWq0
Our latest and greatest coding model! We've made some big strides for web app and visual development. And it continues dominating in user preference: #1 with a 37 Elo gap from #2.
Very excited to share the best coding model we’ve ever built! Today we’re launching Gemini 2.5 Pro Preview 'I/O edition' with massively improved coding capabilities. Ranks no.1 on LMArena in Coding and no.1 on the WebDev Arena Leaderboard. It’s especially good at building interactive web apps - this demo shows how it can be helpful for prototyping ideas. Try it in @GeminiApp, Vertex AI, and AI Studio ai.dev Enjoy the pre-I/O goodies !
8
38
This is so good. Love meta-analyses. From a benchmark it's much harder to optimize the test set (implicitly or otherwise).
The Ultimate LLM Meta-Leaderboard averaged across the 28 best benchmarks Gemini 2.5 Pro > o3 > Sonnet 3.7 Thinking
1
6
41
2.5 Pro Exp is a model we're so proud of: #1 on LMArena, #1 on benchmarks like AIME, Aider, MMMU, and MRCR, & significant gains across coding, reasoning, multimodal, and so much more. Try it now! aistudio.google.com gemini.google.com
Think you know Gemini? 🤔 Think again. Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses. Try it now → goo.gle/4c2HKjf
6
37
Here is what Gemini can do on *Flash*. My favorite perk: Gemini 2.0 Flash Thinking has significant gains in core capabilities while also excellent in user preferences (co-#1 with gemini-exp-1206 on @lmarena_ai). The best of both worlds.
We’ve been *thinking* about how to improve model reasoning and explainability Introducing Gemini 2.0 Flash Thinking, an experimental model trained to think out loud, leading to stronger reasoning performance. Excited to get this first model into the hands of developers to try out!
2
3
48
We’ve been able to ship models in less than 24 hours. I’ve heard multiple VPs state they’ve never seen Google able to ship so quickly before.
1
2
8
I love the team’s shipping speed: today, not only the base model but also our update to Astra for real-time multimodal interactions, our Jules coding assistant, & Colab with Gemini 2.0.
1
2
7
Try out Gemini 2.0 Flash today. We made significant improvements across all domains, especially code, math, and multimodal reasoning. And 2.0 newly has native audio and vision generation! aistudio.google.com/prompts/…
We’re kicking off the start of our Gemini 2.0 era with Gemini 2.0 Flash, which outperforms 1.5 Pro on key benchmarks at 2X speed (see chart below). I’m especially excited to see the fast progress on coding, with more to come.  Developers can try an experimental version in AI Studio and Vertex AI today. It is also available to try in @GeminiApp on the web today, mobile coming soon.
1
4
28
gemini-exp-1206, out now. #1 everywhere. A 1 year anniversary for Gemini! aistudio.google.com/app/prom…
Replying to @arena
Gemini-Exp-1206 tops all the leaderboards, with substantial improvements in coding and hard prompts. Try it at lmarena.ai !
3
8
93
The team says hi again
Woah, huge news again from Chatbot Arena🔥 @GoogleDeepMind’s just released Gemini (Exp 1121) is back stronger (+20 points), tied #1🏅Overall with the latest GPT-4o-1120 in Arena! Ranking gains since Gemini-Exp-1114: - Overall #3 → #1 - Overall (StyleCtrl): #5 -> #2 - Hard Prompts (StyleCtrl): #3 → #1 - Coding: #3 → #1 - Vision: #1 - Math: #2 → #1 - Creative Writing #2 → #1 Congrats again @GoogleDeepMind! The LLM race is on fire — progress is now measured in days! See more analysis below👇
8
5
3
126
Dustin Tran retweeted
Massive News from Chatbot Arena🔥 @GoogleDeepMind's latest Gemini (Exp 1114), tested with 6K+ community votes over the past week, now ranks joint #1 overall with an impressive 40+ score leap — matching 4o-latest in and surpassing o1-preview! It also claims #1 on Vision leaderboard. Gemini-Exp-1114 excels across technical and creative domains: - Overall #3 -> #1 - Math: #3 -> #1 - Hard Prompts: #4 -> #1 - Creative Writing #2 -> #1 - Vision: #2 -> #1 - Coding: #5 -> #3 - Overall (StyleCtrl): #4 -> #4 Huge congrats to @GoogleDeepMind on this remarkable milestone! Come try the new Gemini and share your feedback!
gemini-exp-1114…. available in Google AI Studio right now, enjoy : ) aistudio.google.com
Nice work on controlling style biases! In this view, many models are no longer inflated (e.g., response length, formatting). Gemini 1.5 Flash also outperforms gpt-4o-mini overall and across all categories except for coding.
Does style matter over substance in Arena? Can models "game" human preference through lengthy and well-formatted responses? Today, we're launching style control in our regression model for Chatbot Arena — our first step in separating the impact of style from substance in rankings. Highlights: - GPT-4o-mini, Grok-2-mini drop below most frontier models when style is controlled - Claude 3.5 Sonnet, Opus, and Llama-3.1-405B rise significantly - In Hard Prompts, Claude 3.5 Sonnet ties for #1 with ChatGPT-4o-latest. Llama-405B climbs to joint #3. More analysis in the thread below👇
1
3
26
Dustin Tran retweeted
Our latest version of Gemini 1.5 Pro in AI Studio is #1 on the LMSys leaderboard. 🚀 This is the result of various advances in post-training and we have more lined up. Congrats to the Gemini team.
Exciting News from Chatbot Arena! @GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes. For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard. Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding. Huge congrats to @GoogleDeepMind on this remarkable milestone! Gemini (0801) Category Rankings: - Overall: #1 - Math: #1-3 - Instruction-Following: #1-2 - Coding: #3-5 - Hard Prompts (English): #2-5 Come try the model and let us know your feedback! More analysis below👇
6
18
4
154
Gemini is #1 overall on both text and vision arena, and Gemini is #1 on a staggering total of 20 out of 22 leaderboard categories. It's been a journey attaining such a powerful posttrained model. Proud to have co-lead the team!
Exciting News from Chatbot Arena! @GoogleDeepMind's new Gemini 1.5 Pro (Experimental 0801) has been tested in Arena for the past week, gathering over 12K community votes. For the first time, Google Gemini has claimed the #1 spot, surpassing GPT-4o/Claude-3.5 with an impressive score of 1300 (!), and also achieving #1 on our Vision Leaderboard. Gemini 1.5 Pro (0801) excels in multi-lingual tasks and delivers robust performance in technical areas like Math, Hard Prompts, and Coding. Huge congrats to @GoogleDeepMind on this remarkable milestone! Gemini (0801) Category Rankings: - Overall: #1 - Math: #1-3 - Instruction-Following: #1-2 - Coding: #3-5 - Hard Prompts (English): #2-5 Come try the model and let us know your feedback! More analysis below👇
10
9
1
112
On results: * Spanish is where I expect models at. Gemini is within CI of #1 and should be #1 (it is so good at multilingual and also #1 on LMSYS non-English). * Coding as well. * Math focuses on grade school math which can be saturated. I expect the ranking to change on more complex problems. * Instruction following is surprising. Would be great to iron out whether it's a quirk from eval or generally consistent.
1
1
7
New public leaderboard from Scale! It looks like a solid set of evals. Mitigates two of the biggest problems in evals today: eval sets contaminated in model training, and rater quality for human evaluation.
🚀 Introducing the SEAL Leaderboards! We rank LLMs using private datasets that can’t be gamed. Vetted experts handle the ratings, and we share our methods in detail openly! Check out our leaderboards at scale.com/leaderboard! Which evals should we build next?
2
2
1
23