SambaNova · Nov 10, 2025 · 11:00 PM UTC

SambaNova

SambaNova

@SambaNovaAI

10h

Great read for devs. @thinkymachines does a good technical dive into on-policy distillation and why combining RL and supervised learning works well for math and chat tasks. Let us know what you think below 💭

Thinking Machines

@thinkymachines

Oct 27

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…

Gilfoyle · Nov 10, 2025 · 3:00 PM UTC

Gilfoyle

@wangleineo

18h

Replying to @jiqizhixin

Then on-policy distillation is right on track.

Thinking Machines

@thinkymachines

Oct 27

eric · Nov 7, 2025 · 8:16 PM UTC

Shu Lynn Liu retweeted

eric @erictang000

Nov 7

🧑‍🏫On-Policy Distillation is available as an example on SkyRL! The implementation required no library code changes, and we were able to reproduce AIME math reasoning experiments from the @thinkymachines blogpost. Check out our detailed guide to see how! novasky-ai.notion.site/on-po…

Thinking Machines

@thinkymachines

Oct 27

Freddie · Nov 5, 2025 · 5:14 PM UTC

Freddie @freddie_spirit

Nov 5

1800 hours OPD vs 18,000 hours RL!

Thinking Machines

@thinkymachines

Oct 27

Jakeup · Oct 27, 2025 · 5:15 PM UTC

Hero Thousandfaces retweeted

Jakeup

@yashkaf

Oct 27

RL = play a bunch of chess games with no board awareness, find out at the end if you won or lost SFT = watch videos of Magnus Carlsen playing and try to imitate him On-policy distillation = have a coach grade each move as you make it makes sense that it works

Thinking Machines

@thinkymachines

Oct 27

Dan Zheng · Nov 1, 2025 · 5:03 AM UTC

Dan Zheng @dancherp

Nov 1

I think the "self-distillation via optimized prompts" idea (↓) is like "on-policy distillation" But using a prompt-optimized model as the reverse-KL teacher for its "basic system prompt" self Iterate prompt optzn and self-distillation for gains? Idea:

Thinking Machines

@thinkymachines

Oct 27

Xun Huang · Oct 27, 2025 · 9:10 PM UTC

Melison retweeted

Xun Huang

@xxunhuang

Oct 27

On-Policy Distillation with Reverse KL — sounds like Self-Forcing + DMD for language modeling 😀 Maybe a bidirectional (diffusion) teacher would make it even better?

Thinking Machines

@thinkymachines

Oct 27

Lucas.CheolMin Han · Oct 29, 2025 · 11:25 PM UTC

Lucas.CheolMin Han @LucasCMHan

Oct 29

Replying to @grok @miramurati

■ 성장(#GrowUp) 》표현은 방향(Directions)과 목표(Goals)를 추정하게 한다. '@thinkymachines'이 추구하는 성장은 어느 방향, 어떤 목표일까요? @grok #SocraticMethod #SocraticMethodLucasCM #FutureForecastLucasCM

Thinking Machines

@thinkymachines

Oct 27

Iker · Oct 29, 2025 · 2:51 AM UTC

Iker

@ikercodes

Oct 29

This is great!

Thinking Machines

@thinkymachines

Oct 27

GDP · Oct 28, 2025 · 6:19 PM UTC

GDP

@bookwormengr

Oct 28

Some great insights from the Thnking Machine's on-policy distillation blog: 1. Unhackable: Unlike most reward models in practice, the reverse KL is “unhackable” in the sense that low KL always corresponds to a high probability of desirable behavior from the teacher model’s point of view. 2. Mode Seeking: Another useful property of reverse KL is that it is “mode seeking” — it learns one specific behavior (the teacher’s) instead of spreading its distribution across several suboptimal options. 3. Compute savings: Since it doesn’t require a rollout to finish sampling to calculate the reward, we can use shorter or partial rollouts for training. 4. Only Single forward pass on a large model: Querying the teacher’s log probabilities also requires just a single forward pass from the larger model, while the trajectories are generated by the smaller and cheaper student.

Thinking Machines

@thinkymachines

Oct 27

Charlie Snell · Oct 28, 2025 · 5:46 PM UTC

Charlie Snell @sea_snell

Oct 28

This is a banger

Thinking Machines

@thinkymachines

Oct 27

Daniel Carpenter · Oct 28, 2025 · 5:31 PM UTC

Daniel Carpenter

@dcarps14

Oct 28

This... "There could be advantages to combining distillation-based per-token rewards with sequence-level environment rewards; this is an interesting area for potential future research."

Thinking Machines

@thinkymachines

Oct 27

Honey Nudger · Oct 28, 2025 · 5:29 PM UTC

Honey Nudger

@HoneyNudger

Oct 28

This... "There could be advantages to combining distillation-based per-token rewards with sequence-level environment rewards; this is an interesting area for potential future research."

Thinking Machines

@thinkymachines

Oct 27

liuyo · Oct 28, 2025 · 4:28 PM UTC

liuyo

@iamdiegopy

Oct 28

Simple yet very interesting and apparently effective way of more precisely/specifically tuning params of the model. Also an impressive way of fixing the forgetting of RLed stuff by the model through on policy distillation

Thinking Machines

@thinkymachines

Oct 27

Mohsen Fayyaz · Oct 28, 2025 · 4:01 PM UTC

Mohsen Fayyaz @MohsenFyz

Oct 28

Really amazing blog post, but why don't they cite related previous works such as (not limited to): openreview.net/pdf?id=3zKtaq…

Thinking Machines

@thinkymachines

Oct 27