๐Ÿšจ Your RL only improves ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ, not ๐—ฝ๐—ฎ๐˜€๐˜€@๐—ธ? ๐Ÿšจ Thatโ€™s not a bug โ€” itโ€™s a ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ youโ€™re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. ๐Ÿงต How?

Apr 27, 2025 ยท 4:30 PM UTC

Policy Gradient variants, like PPO, GRPO, all optimize for this objective: The correctness of each ๐—ถ๐—ป๐—ฑ๐—ถ๐˜ƒ๐—ถ๐—ฑ๐˜‚๐—ฎ๐—น sample ๐˜บ. ๐Ÿ‘‰ This optimizes for exactly the pass@1 metrics in training time. Training with a pass@1 objective won't probably yield pass@k miracles.โœจ
2
39
It's a different RL paradigm, in which the reward is not only a function of single trajectory, but a population of trajectories. Given k samples, pass@k is 1 if at least one of them is correct. โœจ We can define the reward exactly as the maximum of the individual reward.
2
1
40
How does the loss looks like? We derive the following formula for the advantage and contrast that with the pass@1 objective like Dr. GRPO or RLOO.
1
1
42
We observe a ๐—ฐ๐—น๐—ฒ๐—ฎ๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ฑ๐—ฒ-๐—ผ๐—ณ๐—ณ between pass@1 and pass@k if you train them using different objective: If you train for pass@1, you get pass@1 increase on eval. If you train for pass@k, you get pass@k increase on eval. It's just that simple.
3
8
2
80
Replying to @KunhaoZ
Awesome thread!!! Have you tried training with the pass@k objective and then training the pass@1 objective on top of this? Iโ€™m curious if that gets better pass@1 performance than just training pass@1
1
1
Nice suggestion! We havenโ€™t tried this ordering of switching the objective but definitely a good exp to run!
1
Replying to @KunhaoZ
What if you set the loss = pass@1 objective + pass@k objective, will the pass@1 and pass@k increase together?
1
4
Of course we can do this: looks like the most obvious way to try out. But thatโ€™s one of many combinations we can play with, for example, logsumexp as the โ€œsoftmaxโ€ to bridge the 2 objectives.
5
Replying to @KunhaoZ
Do people *want* pass@k for many (RL) tasks? AFAICT people mostly use pass@k as a proxy for โ€œmode-collapsednessโ€ or similar. I suspect the thing people want is โ€œgood performance while not sacrificing diversityโ€, although sure mixtures of pass@{1,k} seem like decent-enough proxies
1
15
Replying to @KunhaoZ
Is anyone working to make this an inference-time tradeoff? I.e. a system parameter analogous to temperature but for "candidate solution space" rather than token space, where 0 is maximum pass@1?
4
Replying to @KunhaoZ
I guess the point would be that it's at odds with test time scaling? Of course, you get there through train time scaling, so to speak.
1
Replying to @KunhaoZ
Genius!
1
Replying to @KunhaoZ
Great insight!
1
Replying to @KunhaoZ
Since pass @ K alignment is making sense (similar to human learning where having higher number of good quality students in a class makes overall performance more robust), I wonder if ordered preference during training by difficulty tiers has been tried yet? Aka - start of epoch batches should contain easy tasks and should progressively go towards difficult tasks.
Replying to @KunhaoZ
pretty interesting, sorry I'm not your domain, just wondering what is the Pass@1 and Pass@k?