Some great insights from the Thnking Machine's on-policy distillation blog:
1. Unhackable: Unlike most reward models in practice, the reverse KL is “unhackable” in the sense that low KL always corresponds to a high probability of desirable behavior from the teacher model’s point of view.
2. Mode Seeking: Another useful property of reverse KL is that it is “mode seeking” — it learns one specific behavior (the teacher’s) instead of spreading its distribution across several suboptimal options.
3. Compute savings: Since it doesn’t require a rollout to finish sampling to calculate the reward, we can use shorter or partial rollouts for training.
4. Only Single forward pass on a large model: Querying the teacher’s log probabilities also requires just a single forward pass from the larger model, while the trajectories are generated by the smaller and cheaper student.
Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other approaches for a fraction of the cost.
thinkingmachines.ai/blog/on-…