NVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.

Views my own. Contact →
Joined December 2012
Pinned Tweet
I've been a bit quiet on X recently. The past year has been a transformational experience. Grok-4 and Kimi K2 are awesome, but the world of robotics is a wondrous wild west. It feels like NLP in 2018 when GPT-1 was published, along with BERT and a thousand other flowers that bloomed. No one knew which one would eventually become ChatGPT. Debates were heated. Entropy was sky high. Ideas were insanely fun. I believe the GPT-1 of robotics is already somewhere on Arxiv, but we don't know exactly which one. Could be world models, RL, learning from human video, sim2real, real2sim, etc. etc, or any combo of them. Debates are heated. Entropy is sky high. Ideas are insanely fun, instead of squeezing the last few % on AIME & GPQA. The nature of robotics also greatly complicates the design space. Unlike the clean world of bits for LLMs (text strings), we roboticists have to deal with the messy world of atoms. After all, there's a lump of software-defined metal in the loop. LLM normies may find it hard to believe, but so far roboticists still can't agree on a benchmark! Different robots have different capability envelopes - some are better at acrobatics while others at object manipulation. Some are meant for industrial use while others are for household tasks. Cross-embodiment isn't just a research novelty, but an essential feature for a universal robot brain. I've talked to dozens of C-suite leads from various robot companies, old and new. Some sell the whole body. Some sell body parts such as dexterous hands. Many more others sell the shovels to manufacture new bodies, create simulations, or collect massive troves of data. The business idea space is as wild as research itself. It's a new gold rush, the likes of which we haven't seen since the 2022 ChatGPT wave. The best time to enter is when non-consensus peaks. We're still at the start of a loss curve - there're strong signs of life, but far, far away from convergence. Every gradient step takes us into the unknown. But one thing I do know for sure - there's no AGI without touching, feeling, and being embodied in the messy world. On a more personal note - running a research lab comes with a whole new level of responsibility. Giving updates directly to the CEO of a $4T company is, to put it mildly, both thrilling and all-consuming of my attention weights. Gone are the days when I could stay on top of and dive deep into every AI news. I’ll try to carve out time to share more of my journey.
Listening to Jensen talk about his favorite maths - specs of Vera Rubin chips, and the full stack from lithography to robot fleets assembling physical fabs in Arizona & Houston. Quoting Jensen, “these factories are basically robots themselves”. I visited NVIDIA facilities before and they look absolutely unreal. Sci-fi scenes pale in comparison to the real Matrix, racks over racks fading into the horizon. The art of enchanting rocks to do computation is the greatest craft humanity has mastered. Sometimes I forget I’m at a hardware company with huge muscles to move atoms at unbelievable scale.
36
114
5
1,147
Go check out @yukez’s talk at CoRL! Project GR00T is cooking 🍳
The rise of humanoid platforms presents new opportunities and unique challenges. 🤖 Join @yukez at #CoRL2025 as he shares the latest research on robot foundation models and presents new updates with the #NVIDIAIsaac GR00T platform. Learn more 👉nvda.ws/4gdfBYY
2
70
1
195
GTC is on again at DC! I will be hand picking one golden ticket winner for a complimentary pass, special seating for Jensen's keynote, NV swags, and other perks! Reply with your coolest open-source project on GR00T N1/N1.5/Dream models!
6
9
76
There was something deeply satisfying about ImageNet. It had a well curated training set. A clearly defined testing protocol. A competition that rallied the best researchers. And a leaderboard that spawned ResNets and ViTs, and ultimately changed the field for good. Then NLP followed. No matter how much OpenAI, Anthropic, and xAI disagree, they at least agree on one thing: benchmarking. MMLU, HLE, SWEBench - you can’t make progress until you are able to measure it. Robotics still doesn’t have such a rallying call. No one agrees on anything: hardware, task, scoring, simulation engine, or real world environment. Everyone is SOTA, by definition, on the benchmark they define on the fly for each paper. From the maker of ImageNet - BEHAVIOR takes a stab at the daunting challenge of unifying robotics benchmarking on a reproducible physics engine (Isaac Sim). The project started before I graduated from Stanford Vision Lab, and took so many years of dedication and PhD careers to build. I hope BEHAVIOR is either the hill-climbing signal we need, or the spark that finally gets us talking about how to measure real progress as a field.
(1/N) How close are we to enabling robots to solve the long-horizon, complex tasks that matter in everyday life? 🚨 We are thrilled to invite you to join the 1st BEHAVIOR Challenge @NeurIPS 2025, submission deadline: 11/15. 🏆 Prizes: 🥇 $1,000 🥈 $500 🥉 $300
33
303
8
1,150
Vibe Minecraft: a multi-player, self-consistent, real-time world model that allows building anything and conjuring any objects. The function of tools and even the game mechanics itself can be programmed by natural language, such as "chrono-pickaxe: revert any block to a previous state in time" and "waterfalls turn into rainbow bridge when unicorns pass by". Players collectively define and manipulate a shared world. The neural sim takes as input a *multimodal* system prompt: game rules, asset pngs, a global map, and easter eggs. It periodically saves game states as a sequence of latent vectors that can be loaded back into context, optionally with interleaved "guidance texts" to allow easy editing. Each gamer has their own explicit stat json (health, inventory, 3D coordinate) as well as implicit "player vectors" that capture higher-order interaction history. Game admins can create a Minecraft multiverse because the latents are compatible from different servers. Each world can seamlessly cross with another to spawn new worlds in seconds. People can mix & match with their friends' or their own past states. "Rare vectors" can emerge as some players would inevitably wander into the bizarre, uncharted latent space of the world model. Those float matrices can be traded as NFTs. The wilders things you try, the more likely you'll mine rare vectors. Whoever ships Vibe Minecraft first will go down in history as altering the course of gaming forever.
Would love to see the FSD Scaling Law, as it’s the only physical data flywheel at planetary scale. What’s the “emergent ability threshold” for model/data size?
Tesla is training a new FSD model with ~10X params and a big improvement to video compression loss. Probably ready for public release end of next month if testing goes well.
This may be a testament to the “Reasoning Core Hypothesis” - reasoning itself only needs a minimal level of linguistic competency, instead of giant knowledge bases in 100Bs of MoE parameters. It also plays well with Andrej’s LLM OS - a processor that’s as lightweight and fast as possible, and maximally relies on knowledge lookup, tool use, agentic flow, etc. Now I’m curious - what’s the absolute smallest model we can squeeze that still functions as a competent LLM OS Kernel?
🚀 Introducing Qwen3-4B-Instruct-2507 & Qwen3-4B-Thinking-2507 — smarter, sharper, and 256K-ready! 🔹 Instruct: Boosted general skills, multilingual coverage, and long-context instruction following. 🔹 Thinking: Advanced reasoning in logic, math, science & code — built for expert-level tasks. Both models are more aligned, more capable, and more context-aware. Huggingface: huggingface.co/Qwen/Qwen3-4B… huggingface.co/Qwen/Qwen3-4B… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…
World modeling for robotics is incredibly hard because (1) control of humanoid robots & 5-finger hands is wayyy harder than ⬆️⬅️⬇️➡️ in games (Genie 3); and (2) object interaction is much more diverse than FSD, which needs to *avoid* coming into contact. Our GR00T Dreams work was a first attempt at building high-fidelity world simulator for humanoid robots. It's not only for evaluation but also for large-scale synthetic data generation. Time to move away from the "fossil fuel" of robotics (human teleoperation) and embrace clean energy (nuclear "diffusion")! GR00T Dreams kind of flew under the radar, so bringing it back to life on a cheerful day ;)
What if robots could dream inside a video generative model? Introducing DreamGen, a new engine that scales up robot learning not with fleets of human operators, but with digital dreams in pixels. DreamGen produces massive volumes of neural trajectories - photorealistic robot videos paired with motor action labels - and unlocks strong generalization to new nouns, verbs, and environments. Whether you’re a humanoid (GR1), an industrial arm (Franka), or a cute little robot (HuggingFace SO-100), DreamGen enables you to dream. Video generation models like Sora & Veo are neural physics engines. By compressing billions of internet videos, they learn a multiverse of plausible futures, i.e. superpositions of how the world could unfold from any initial image frame. DreamGen taps into this power with a simple 4-step recipe: 1. Fine-tune a SOTA video model on your target robot; 2. Prompt the model with diverse language prompts to simulate parallel worlds: how your robot would have acted in new scenarios. Filter out the bad dreams (ha!) that don’t follow instructions; 3. Recover pseudo-actions using inverse dynamics or latent action models; 4. Train robot foundation models on the massively augmented dataset of neural trajectories. That’s it. Just more data, and plain old supervised learning. Simple, right? What’s remarkable is how far this goes. Starting with just a single-task dataset of pick-and-place, our humanoid robot learns 22 new behaviors, such as pouring, folding, scooping, ironing, and hammering, despite never seeing those verbs before. Better yet, we can take the robot out of the lab and drop it into the NVIDIA HQ Cafe, and let DreamGen work its magic. We show true zero-to-one generalization: from 0% success to over 43% for novel verbs, and 0 -> 28% in unseen environments. Compared to a traditional graphics engine, DreamGen doesn’t care if the scene involves deformable objects, fluids, translucent materials, contact-rich interactions, or crazy lighting. Good luck engineering those by hand. For DreamGen, every world is just a forward pass through a diffusion neural net. No matter how complex the dream is, it takes constant compute time to roll out. Read our blog and paper today! We plan to fully open-source the entire pipeline in the next few weeks. Links in thread:
36
162
13
1,298
Evaluation is the hardest problem for physical AI systems: do you crash test cars every time you debug a new FSD build? Traditional game engine (sim 1.0) is an alternative, but it's not possible to hard-code all edge cases. A neural net-based sim 2.0 is purely programmed by data, grows more capable with data, and scales as the fleet data flywheel scales.
Replying to @DrJimFan
Tesla has had this for a few years. Used for creating unusual training examples (eg near head-on collisions), where even 8 million vehicles in the field need supplemental data, especially as our cars get safer and dangerous situations become very rare.
This is game engine 2.0. Some day, all the complexity of UE5 will be absorbed by a data-driven blob of attention weights. Those weights take as input game controller commands and directly animate a spacetime chunk of pixels. Agrim and I were close friends and coauthors back at Stanford Vision Lab. So great to see him at the frontier of such cool research! Congrats!
Introducing Genie 3, our state-of-the-art world model that generates interactive worlds from text, enabling real-time interaction at 24 fps with minutes-long consistency at 720p. 🧵👇
80
207
18
2,243
No em dash should be baked into pretraining, post-training, alignment, system prompt, and every nook and cranny in an LLM’s lifecycle. It needs to be hardwired into the kernel, identity, and very being of the model.
Shengjia is one of the brightest, humblest, and most passionate scientists I know. We went to PhD together for 5 yrs, sitting across the hall at Stanford Gates building. Good old times. I didn’t expect this, but not at all surprised either. Very bullish on MSL!
We're excited to have @shengjia_zhao at the helm as Chief Scientist of Meta Superintelligence Labs. Big things are coming! 🚀 See Mark's post: threads.com/@zuck/post/DMiwj…
32
208
1
1,945
I'm observing a mini Moravec's paradox within robotics: gymnastics that are difficult for humans are much easier for robots than "unsexy" tasks like cooking, cleaning, and assembling. It leads to a cognitive dissonance for people outside the field, "so, robots can parkour & breakdance, but why can't they take care of my dog?" Trust me, I got asked by my parents about this more than you think ... The "Robot Moravec's paradox" also creates the illusion that physical AI capabilities are way more advanced than they truly are. I'm not singling out Unitree, as it applies widely to all recent acrobatic demos in the industry. Here's a simple test: if you set up a wall in front of the side-flipping robot, it will slam into it at full force and make a spectacle. Because it's just overfitting that single reference motion, without any awareness of the surroundings. Here's why the paradox exists: it's much easier to train a "blind gymnast" than a robot that sees and manipulates. The former can be solved entirely in simulation and transferred zero-shot to the real world, while the latter demands extremely realistic rendering, contact physics, and messy real-world object dynamics - none of which can be simulated well. Imagine you can train LLMs not from the internet, but from a purely hand-crafted text console game. Roboticists got lucky. We happen to live in a world where accelerated physics engines are so good that we can get away with impressive acrobatics using literally zero real data. But we haven't yet discovered the same cheat code for general dexterity. Till then, we'll still get questioned by our confused parents.
My bar for AGI is far simpler: an AI cooking a nice dinner at anyone’s house for any cuisine. The Physical Turing Test is very likely harder than the Nobel Prize. Moravec’s paradox will continue to haunt us, looming larger and darker, for the decade to come.
My bar for AGI is an AI winning a Nobel Prize for a new theory it originated.
Attending CVPR at Nashville! Email or DM me. I’ll float around the venue like brownian motion. Recruiting!
Also check out the deep dive thread from @jang_yoel for more technical details!
Introducing 𝐃𝐫𝐞𝐚𝐦𝐆𝐞𝐧! We got humanoid robots to perform totally new 𝑣𝑒𝑟𝑏𝑠 in new environments through video world models. We believe video world models will solve the data problem in robotics. Bringing the paradigm of scaling human hours to GPU hours. Quick 🧵
3
8
56