๐Ÿšจ Your RL only improves ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ, not ๐—ฝ๐—ฎ๐˜€๐˜€@๐—ธ? ๐Ÿšจ Thatโ€™s not a bug โ€” itโ€™s a ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ youโ€™re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. ๐Ÿงต How?
Replying to @KunhaoZ
What if you set the loss = pass@1 objective + pass@k objective, will the pass@1 and pass@k increase together?

Apr 28, 2025 ยท 6:13 AM UTC

1
4
Replying to @pnynx3
Of course we can do this: looks like the most obvious way to try out. But thatโ€™s one of many combinations we can play with, for example, logsumexp as the โ€œsoftmaxโ€ to bridge the 2 objectives.
5