๐Ÿšจ Your RL only improves ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ, not ๐—ฝ๐—ฎ๐˜€๐˜€@๐—ธ? ๐Ÿšจ Thatโ€™s not a bug โ€” itโ€™s a ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ youโ€™re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. ๐Ÿงต How?
Policy Gradient variants, like PPO, GRPO, all optimize for this objective: The correctness of each ๐—ถ๐—ป๐—ฑ๐—ถ๐˜ƒ๐—ถ๐—ฑ๐˜‚๐—ฎ๐—น sample ๐˜บ. ๐Ÿ‘‰ This optimizes for exactly the pass@1 metrics in training time. Training with a pass@1 objective won't probably yield pass@k miracles.โœจ
2
39
It's a different RL paradigm, in which the reward is not only a function of single trajectory, but a population of trajectories. Given k samples, pass@k is 1 if at least one of them is correct. โœจ We can define the reward exactly as the maximum of the individual reward.
2
1
40
How does the loss looks like? We derive the following formula for the advantage and contrast that with the pass@1 objective like Dr. GRPO or RLOO.
1
1
42
We observe a ๐—ฐ๐—น๐—ฒ๐—ฎ๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ฑ๐—ฒ-๐—ผ๐—ณ๐—ณ between pass@1 and pass@k if you train them using different objective: If you train for pass@1, you get pass@1 increase on eval. If you train for pass@k, you get pass@k increase on eval. It's just that simple.
3
5
2
80
It's after midnight so I'll read it tomorrow but thank you for putting this out there after that other paper I read earlier in the week on this topic. It seemed like pass@1 was essentially exactly what they had trained for, so of course they got the results they did.