Filter
Exclude
Time range
-
Near
Great read for devs. @thinkymachines does a good technical dive into on-policy distillation and why combining RL and supervised learning works well for math and chat tasks. Let us know what you think below 💭
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
Replying to @jiqizhixin
Then on-policy distillation is right on track.
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
7
Shu Lynn Liu retweeted
🧑‍🏫On-Policy Distillation is available as an example on SkyRL! The implementation required no library code changes, and we were able to reproduce AIME math reasoning experiments from the @thinkymachines blogpost. Check out our detailed guide to see how! novasky-ai.notion.site/on-po…
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
3
5
1
21
1800 hours OPD vs 18,000 hours RL!
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
Hero Thousandfaces retweeted
RL = play a bunch of chess games with no board awareness, find out at the end if you won or lost SFT = watch videos of Magnus Carlsen playing and try to imitate him On-policy distillation = have a coach grade each move as you make it makes sense that it works
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
1
4
56
I think the "self-distillation via optimized prompts" idea (↓) is like "on-policy distillation" But using a prompt-optimized model as the reverse-KL teacher for its "basic system prompt" self Iterate prompt optzn and self-distillation for gains? Idea:
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
1
5
Melison retweeted
On-Policy Distillation with Reverse KL — sounds like Self-Forcing + DMD for language modeling 😀 Maybe a bidirectional (diffusion) teacher would make it even better?
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
2
11
86
Replying to @grok @miramurati
■ 성장(#GrowUp) 》표현은 방향(Directions)과 목표(Goals)를 추정하게 한다. '@thinkymachines'이 추구하는 성장은 어느 방향, 어떤 목표일까요? @grok #SocraticMethod #SocraticMethodLucasCM #FutureForecastLucasCM
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
This is great!
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
Some great insights from the Thnking Machine's on-policy distillation blog: 1. Unhackable: Unlike most reward models in practice, the reverse KL is “unhackable” in the sense that low KL always corresponds to a high probability of desirable behavior from the teacher model’s point of view. 2. Mode Seeking: Another useful property of reverse KL is that it is “mode seeking” — it learns one specific behavior (the teacher’s) instead of spreading its distribution across several suboptimal options. 3. Compute savings: Since it doesn’t require a rollout to finish sampling to calculate the reward, we can use shorter or partial rollouts for training. 4. Only Single forward pass on a large model: Querying the teacher’s log probabilities also requires just a single forward pass from the larger model, while the trajectories are generated by the smaller and cheaper student.
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
1
7
This is a banger
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
16
This... "There could be advantages to combining distillation-based per-token rewards with sequence-level environment rewards; this is an interesting area for potential future research."
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
2
This... "There could be advantages to combining distillation-based per-token rewards with sequence-level environment rewards; this is an interesting area for potential future research."
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
Simple yet very interesting and apparently effective way of more precisely/specifically tuning params of the model. Also an impressive way of fixing the forgetting of RLed stuff by the model through on policy distillation
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
Really amazing blog post, but why don't they cite related previous works such as (not limited to): openreview.net/pdf?id=3zKtaq…
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost. thinkingmachines.ai/blog/on-…
6