Ali Hatamizadeh · Sep 30, 2025 · 8:00 PM UTC

Ali Hatamizadeh

Pinned Tweet

Ali Hatamizadeh

@ahatamiz1

Sep 30

Are you ready for web-scale pre-training with RL ? 🚀 🔥 New paper: RLP : Reinforcement Learning Pre‑training We flip the usual recipe for reasoning LLMs: instead of saving RL for post‑training, we bring exploration into pretraining. Core idea: treat chain‑of‑thought as an action. Reward it by the information gain it provides for the very next token: This gives a verifier‑free, dense reward on ordinary text with no task checkers, no labels, no filtering. Why this matters ? * 🧠 Models think before predicting during pretraining, not just after alignment. * 📈 Position‑wise credit at every token = stable signal at full web‑scale. * 🔁 No proxy filters or “easy‑token” heuristics. Trains on the entire stream. Results: On the 8‑benchmark math+science suite (AIME’25, MATH‑500, GSM8K, AMC’23, Minerva Math, MMLU, MMLU‑Pro, GPQA): • Qwen3-1.7B-Base: RLP improves the overall average by 24% ! • Nemotron-Nano-12B-v2-Base: RLP improves the overall average by 43% ! 📄Paper: tinyurl.com/rlp-pretraining ✍️Blog: research.nvidia.com/labs/adl… #AI #LLM #ReinforcementLearning #ChainOfThought #Pretraining #RLP

112

707

Ali Hatamizadeh · Oct 14, 2025 · 7:41 PM UTC

Ali Hatamizadeh

@ahatamiz1

Oct 14

If you're a PhD student interested in doing an internship with me and @shrimai_ on RL–based pre-training/LLM reasoning, send an email (ahatamizadeh@nvidia.com) with: 1⃣: Short intro about you 2⃣: Link to your relevant paper I will read all emails but can't respond to all.

380

Marktechpost AI Dev News ⚡ · Oct 14, 2025 · 10:02 AM UTC

Ali Hatamizadeh retweeted

Marktechpost AI Dev News ⚡

@Marktechpost

Oct 14

NVIDIA Researchers Propose Reinforcement Learning Pretraining (RLP): Reinforcement as a Pretraining Objective for Building Reasoning During Pretraining RLP makes “think-before-predict” a pretraining objective: it samples a short chain-of-thought as an action and rewards it by information gain—the log-likelihood improvement of the next token versus a no-think EMA teacher—yielding a verifier-free, dense, position-wise signal that works on ordinary text streams at scale; empirically, RLP lifts Qwen3-1.7B math+science averages by +19% vs Base and +17% vs compute-matched CPT, with gains persisting after identical SFT+RLVR, and on Nemotron-Nano-12B v2 increases the overall average 42.81%→61.32% with +23 points on scientific reasoning using ~200B fewer NTP tokens... Full analysis: marktechpost.com/2025/10/14/… Paper: github.com/NVlabs/RLP/blob/m… Code: github.com/NVlabs/RLP Project: research.nvidia.com/labs/adl… @ahatamiz1 @nvidia @NVIDIAAI @nvidianewsroom

Yejin Choi · Oct 13, 2025 · 4:58 PM UTC

Ali Hatamizadeh retweeted

Yejin Choi

@YejinChoinka

Oct 13

Instruction tuning has a hidden cost: ✅ Better at following instructions ❌ Narrower output distribution ❌ Worse in-context steerability We built 🌈 Spectrum Suite to investigate this and 🌈 Spectrum Tuning as an alternative post-training method —

Taylor Sorensen

@ma_tay_

Oct 13

🤖➡️📉 Post-training made LLMs better at chat and reasoning—but worse at distributional alignment, diversity, and sometimes even steering(!) We measure this with our new resource (Spectrum Suite) and introduce Spectrum Tuning (method) to bring them back into our models! 🌈 1/🧵

227

Ali Hatamizadeh · Oct 12, 2025 · 8:19 PM UTC

Ali Hatamizadeh

@ahatamiz1

Oct 12

🚀 Get ready, something exciting is coming soon! 🎉 #iGRPO

VentureBeat · Oct 10, 2025 · 2:26 PM UTC

Ali Hatamizadeh retweeted

VentureBeat

@VentureBeat

Oct 10

By teaching models to reason during foundational training, RLP aims to reduce logical errors and boost reliability for complex reasoning workflows. venturebeat.com/ai/nvidia-re…

Nvidia researchers boost LLMs reasoning skills by getting them to 'think' during pre-training

By teaching models to reason during foundational training, the verifier-free method aims to reduce logical errors and boost reliability for complex enterprise workflows.

venturebeat.com

Ali Hatamizadeh · Oct 6, 2025 · 5:02 AM UTC

Ali Hatamizadeh

@ahatamiz1

Oct 6

Thank you @_akhaliq for featuring our work (RLP: Reinforcement Learning Pre-training) ! Today’s LLMs learn to predict, then later try to think. But, what if LLMs learned to think while pretraining? 🧠 Treat chain-of-thought as an action. 📈 Reward by information gain → verifier-free, dense, stable signal at scale. 🔁 No task checkers, labels, or filters — full-stream training. Results (8-bench math+science): • Qwen3-1.7B-Base: +24% avg • Nemotron-Nano-12B-Base: +43% avg 📄 Paper: arxiv.org/abs/2510.01265 ✍️ Blog: research.nvidia.com/labs/adl… 💻 Repo: github.com/NVlabs/RLP #AI #LLM #ReinforcementLearning #ChainOfThought #Pretraining #RLP

RLP: Reinforcement as a Pretraining Objective

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling...

arxiv.org

@_akhaliq

Oct 3

Nvidia presents RLP Reinforcement as a Pretraining Objective

128

Rohan Paul · Oct 4, 2025 · 5:51 PM UTC

Ali Hatamizadeh retweeted

Rohan Paul

@rohanpaul_ai

Oct 4

New @nvidia paper makes models think before predicting, training this behavior during pretraining for stronger reasoning. The novelty is that it makes base models practice reasoning during pretraining, not just after. The reward needs no verifier and appears at every token, so it scales to huge text. What this paper does is turn the “thinking” step into an explicit, measurable part of training. Before predicting each next token, the model writes a short internal chain of thought. The training system then checks how much that extra thought improves the probability of getting the right next token compared to if the model hadn’t thought at all. That improvement becomes a numeric reward signal. The score exists at every token, so feedback is dense and needs no external checker. Only the thought tokens get updated, and a clip rule keeps training stable. This builds a habit of helpful thinking instead of word by word guessing. On a 1.7B base model, math and science averages rise by 19%. These gains persist after the same post training. On a 12B hybrid model, accuracy rises by 35% using only 0.125% of the data. It beats methods that give yes or no rewards only on selected hard tokens. ---- Paper – arxiv. org/abs/2510.01265 Paper Title: "RLP: Reinforcement as a Pretraining Objective"

272

Shrimai · Oct 2, 2025 · 9:07 PM UTC

Ali Hatamizadeh retweeted

Shrimai

@shrimai_

Oct 2

When should LLMs learn to reason—early in pretraining or late in fine-tuning?🤔 Front-Loading Reasoning, shows that injecting reasoning data early creates durable, compounding gains that post-training alone cannot recover Paper:tinyurl.com/3tzkemtp Blog:research.nvidia.com/labs/adl…

Syeda Nahida Akter · Oct 2, 2025 · 8:04 PM UTC

Ali Hatamizadeh retweeted

Syeda Nahida Akter

@__SyedaAkter

Oct 2

When should an LLM learn to reason? 🤔 Early in pretraining or late in fine-tuning? Our new work, "Front-Loading Reasoning", challenges the "save it for later" approach. We show that injecting reasoning data into pretraining is critical for building models that reach the frontier. 📝 Blog: research.nvidia.com/labs/adl… 🔗Paper: tinyurl.com/3tzkemtp 🧵↓

148

Shrimai · Sep 30, 2025 · 7:52 PM UTC

Ali Hatamizadeh retweeted

Shrimai

@shrimai_

Sep 30

💫 Introducing RLP: Reinforcement Learning Pretraining—information-driven, verifier-free objective that teaches models to think before they predict 🔥+19% vs BASE on Qwen3-1.7B 🚀+35% vs BASE on Nemotron-Nano-12B 📄Paper: github.com/NVlabs/RLP/blob/m… 📝Blog: research.nvidia.com/labs/adl…

229

Syeda Nahida Akter · Sep 30, 2025 · 7:24 PM UTC

Ali Hatamizadeh retweeted

Syeda Nahida Akter

@__SyedaAkter

Sep 30

Most LLMs learn to think only after pretraining—via SFT or RL. But what if they could learn to think during it? 🤔 Introducing RLP: Reinforcement Learning Pre-training—a verifier-free objective that teaches models to “think before predicting.” 🔥 Result: Massive reasoning boosts & gains that COMPOUND after post-training! 📝 Blog: research.nvidia.com/labs/adl… 🔗Paper: github.com/NVlabs/RLP/blob/m… 🧵↓

260

Ali Hatamizadeh · Sep 26, 2025 · 6:46 PM UTC

Ali Hatamizadeh

@ahatamiz1

Sep 26

Pre-training + RL: better together. Stay tuned for something exciting !

Ali Hatamizadeh · Sep 26, 2025 · 7:13 AM UTC

Ali Hatamizadeh

@ahatamiz1

Sep 26

I’ll be sharing fresh research on LLM reasoning very soon. #iGRPO #SuperIntelligence

Binyuan Hui · Sep 11, 2025 · 11:53 PM UTC

Ali Hatamizadeh retweeted

Binyuan Hui

@huybery

Sep 11

Qwen3-Next represents our latest exploration in hybrid model. By combining Gated DeltaNet and standard attention in a 3:1 ratio, we achieve stronger in-context learning and better overall performance. More importantly, Qwen3-Next delivers advantages in both training efficiency and inference speed. We welcome your feedback and suggestions!

Qwen

@Alibaba_Qwen

Sep 11

🚀 Introducing Qwen3-Next-80B-A3B — the FUTURE of efficient LLMs is here! 🔹 80B params, but only 3B activated per token → 10x cheaper training, 10x faster inference than Qwen3-32B.(esp. @ 32K+ context!) 🔹Hybrid Architecture: Gated DeltaNet + Gated Attention → best of speed & recall 🔹 Ultra-sparse MoE: 512 experts, 10 routed + 1 shared 🔹 Multi-Token Prediction → turbo-charged speculative decoding 🔹 Beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context 🧠 Qwen3-Next-80B-A3B-Instruct approaches our 235B flagship. 🧠 Qwen3-Next-80B-A3B-Thinking outperforms Gemini-2.5-Flash-Thinking. Try it now: chat.qwen.ai/ Blog: qwen.ai/blog?id=4074cca80393… Huggingface: huggingface.co/collections/Q… ModelScope: modelscope.cn/collections/Qw… Kaggle: kaggle.com/models/qwen-lm/qw… Alibaba Cloud API: alibabacloud.com/help/en/mod…

497

Songlin Yang · Sep 11, 2025 · 6:28 PM UTC

Ali Hatamizadeh retweeted

Songlin Yang

@SonglinYang4

Sep 11

Excited to see Gated DeltaNet being adopted in the @Alibaba_Qwen series ! It has also previously demonstrated strong effectiveness in @nvidia's Jet-Nemotron

Qwen

@Alibaba_Qwen

Sep 11

552

Ali Hatamizadeh · Sep 10, 2025 · 4:24 PM UTC

Ali Hatamizadeh

@ahatamiz1

Sep 10

Gated DeltaNet has been integrated as the linear component of the hybrid Qwen3-Next model 🎉🥂🎊 Code: github.com/bozheng-hit/trans…

Ali Hatamizadeh · Sep 4, 2025 · 6:44 PM UTC

Ali Hatamizadeh

@ahatamiz1

Sep 4

Can we apply LoRA adaptors to models with hybrid attention layers like GatedDeltaNet? Yes ! LoRA works anywhere there’s a learned affine map. For hybrid attention like GatedDeltaNet, put LoRA on the linear projectors and the FFN feature projectors. The gated part is usually element-wise scales/vectors. LoRA won’t touch those, so just unfreeze those tiny gate params.

Ali Hatamizadeh · Sep 3, 2025 · 9:58 PM UTC

Ali Hatamizadeh

@ahatamiz1

Sep 3

After extensive large-scale, long-horizon RL training for LLM reasoning, we've found KL divergence penalties, though theoretically sound, are a poor use of limited VRAM. That memory is better allocated to longer completion sequences which consistently deliver better results with better exploration and superior sample efficiency. #RL #LLM #Reasoning