Senior Research Scientist @NVIDIA RL for Next-Gen LLM Pre-training & Reasoning

Joined June 2015
Are you ready for web-scale pre-training with RL ? 🚀 🔥 New paper: RLP : Reinforcement Learning Pre‑training We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining. Core idea: treat chain‑of‑thought as an action. Reward it by the information gain it provides for the very next token: This gives a verifier‑free, dense reward on ordinary text with no task checkers, no labels, no filtering. Why this matters ? * 🧠 Models think before predicting during pretraining, not just after alignment. * 📈 Position‑wise credit at every token = stable signal at full web‑scale. * 🔁 No proxy filters or “easy‑token” heuristics. Trains on the entire stream. Results: On the 8‑benchmark math+science suite (AIME’25, MATH‑500, GSM8K, AMC’23, Minerva Math, MMLU, MMLU‑Pro, GPQA): • Qwen3-1.7B-Base: RLP improves the overall average by 24% ! • Nemotron-Nano-12B-v2-Base: RLP improves the overall average by 43% ! 📄Paper: tinyurl.com/rlp-pretraining ✍️Blog: research.nvidia.com/labs/adl… #AI #LLM #ReinforcementLearning #ChainOfThought #Pretraining #RLP
If you're a PhD student interested in doing an internship with me and @shrimai_ on RL–based pre-training/LLM reasoning, send an email (ahatamizadeh@nvidia.com) with: 1⃣: Short intro about you 2⃣: Link to your relevant paper I will read all emails but can't respond to all.
3
40
1
380
NVIDIA Researchers Propose Reinforcement Learning Pretraining (RLP): Reinforcement as a Pretraining Objective for Building Reasoning During Pretraining RLP makes “think-before-predict” a pretraining objective: it samples a short chain-of-thought as an action and rewards it by information gain—the log-likelihood improvement of the next token versus a no-think EMA teacher—yielding a verifier-free, dense, position-wise signal that works on ordinary text streams at scale; empirically, RLP lifts Qwen3-1.7B math+science averages by +19% vs Base and +17% vs compute-matched CPT, with gains persisting after identical SFT+RLVR, and on Nemotron-Nano-12B v2 increases the overall average 42.81%→61.32% with +23 points on scientific reasoning using ~200B fewer NTP tokens... Full analysis: marktechpost.com/2025/10/14/… Paper: github.com/NVlabs/RLP/blob/m… Code: github.com/NVlabs/RLP Project: research.nvidia.com/labs/adl… @ahatamiz1 @nvidia @NVIDIAAI @nvidianewsroom
8
2
18
Ali Hatamizadeh retweeted
Instruction tuning has a hidden cost: ✅ Better at following instructions ❌ Narrower output distribution ❌ Worse in-context steerability We built 🌈 Spectrum Suite to investigate this and 🌈 Spectrum Tuning as an alternative post-training method —
🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵
3
28
227
🚀 Get ready, something exciting is coming soon! 🎉 #iGRPO
1
32
Thank you @_akhaliq for featuring our work (RLP: Reinforcement Learning Pre-training) ! Today’s LLMs learn to predict, then later try to think. But, what if LLMs learned to think while pretraining? 🧠 Treat chain-of-thought as an action. 📈 Reward by information gain → verifier-free, dense, stable signal at scale. 🔁 No task checkers, labels, or filters — full-stream training. Results (8-bench math+science): • Qwen3-1.7B-Base: +24% avg • Nemotron-Nano-12B-Base: +43% avg 📄 Paper: arxiv.org/abs/2510.01265 ✍️ Blog: research.nvidia.com/labs/adl… 💻 Repo: github.com/NVlabs/RLP #AI #LLM #ReinforcementLearning #ChainOfThought #Pretraining #RLP
Nvidia presents RLP Reinforcement as a Pretraining Objective
5
14
128
Ali Hatamizadeh retweeted
New @nvidia paper makes models think before predicting, training this behavior during pretraining for stronger reasoning. The novelty is that it makes base models practice reasoning during pretraining, not just after. The reward needs no verifier and appears at every token, so it scales to huge text. What this paper does is turn the “thinking” step into an explicit, measurable part of training. Before predicting each next token, the model writes a short internal chain of thought. The training system then checks how much that extra thought improves the probability of getting the right next token compared to if the model hadn’t thought at all. That improvement becomes a numeric reward signal. The score exists at every token, so feedback is dense and needs no external checker. Only the thought tokens get updated, and a clip rule keeps training stable. This builds a habit of helpful thinking instead of word by word guessing. On a 1.7B base model, math and science averages rise by 19%. These gains persist after the same post training. On a 12B hybrid model, accuracy rises by 35% using only 0.125% of the data. It beats methods that give yes or no rewards only on selected hard tokens. ---- Paper – arxiv. org/abs/2510.01265 Paper Title: "RLP: Reinforcement as a Pretraining Objective"
Ali Hatamizadeh retweeted
When should LLMs learn to reason—early in pretraining or late in fine-tuning?🤔 Front-Loading Reasoning, shows that injecting reasoning data early creates durable, compounding gains that post-training alone cannot recover Paper:tinyurl.com/3tzkemtp Blog:research.nvidia.com/labs/adl…
4
11
46
Ali Hatamizadeh retweeted
When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning? Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the frontier. 📝 Blog: research.nvidia.com/labs/adl… 🔗Paper: tinyurl.com/3tzkemtp 🧵↓
3
33
3
148
Ali Hatamizadeh retweeted
💫 Introducing RLP: Reinforcement Learning Pretraining—information-driven, verifier-free objective that teaches models to think before they predict 🔥+19% vs BASE on Qwen3-1.7B 🚀+35% vs BASE on Nemotron-Nano-12B 📄Paper: github.com/NVlabs/RLP/blob/m… 📝Blog: research.nvidia.com/labs/adl…
2
43
2
229
Ali Hatamizadeh retweeted
Most LLMs learn to think only after pretraining—via SFT or RL. But what if they could learn to think during it? 🤔 Introducing RLP: Reinforcement Learning Pre-training—a verifier-free objective that teaches models to “think before predicting.” 🔥 Result: Massive reasoning boosts & gains that COMPOUND after post-training! 📝 Blog: research.nvidia.com/labs/adl… 🔗Paper: github.com/NVlabs/RLP/blob/m… 🧵↓
7
44
2
260
Pre-training + RL: better together. Stay tuned for something exciting !
2
14
I’ll be sharing fresh research on LLM reasoning very soon. #iGRPO #SuperIntelligence
7
Ali Hatamizadeh retweeted
Qwen3-Next represents our latest exploration in hybrid model. By combining Gated DeltaNet and standard attention in a 3:1 ratio, we achieve stronger in-context learning and better overall performance. More importantly, Qwen3-Next delivers advantages in both training efficiency and inference speed. We welcome your feedback and suggestions!
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall 🔹 Ultra-sparse MoE: 512 experts, 10 routed + 1 shared 🔹 Multi-Token Prediction → turbo-charged speculative decoding 🔹 Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context 🧠 Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. 🧠 Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking. Try it now: chat.qwen.ai/ Blog: qwen.ai/blog?id=4074cca80393… Huggingface: huggingface.co/collections/Q… ModelScope: modelscope.cn/collections/Qw… Kaggle: kaggle.com/models/qwen-lm/qw… Alibaba Cloud API: alibabacloud.com/help/en/mod…
23
48
497
Ali Hatamizadeh retweeted
Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron
🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall 🔹 Ultra-sparse MoE: 512 experts, 10 routed + 1 shared 🔹 Multi-Token Prediction → turbo-charged speculative decoding 🔹 Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context 🧠 Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. 🧠 Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking. Try it now: chat.qwen.ai/ Blog: qwen.ai/blog?id=4074cca80393… Huggingface: huggingface.co/collections/Q… ModelScope: modelscope.cn/collections/Qw… Kaggle: kaggle.com/models/qwen-lm/qw… Alibaba Cloud API: alibabacloud.com/help/en/mod…
9
53
4
552
Gated DeltaNet has been integrated as the linear component of the hybrid Qwen3-Next model 🎉🥂🎊 Code: github.com/bozheng-hit/trans…
5
Can we apply LoRA adaptors to models with hybrid attention layers like GatedDeltaNet? Yes ! LoRA works anywhere there’s a learned affine map. For hybrid attention like GatedDeltaNet, put LoRA on the linear projectors and the FFN feature projectors. The gated part is usually element-wise scales/vectors. LoRA won’t touch those, so just unfreeze those tiny gate params.
5
After extensive large-scale, long-horizon RL training for LLM reasoning, we've found KL divergence penalties, though theoretically sound, are a poor use of limited VRAM. That memory is better allocated to longer completion sequences which consistently deliver better results with better exploration and superior sample efficiency. #RL #LLM #Reasoning
4