๐Ÿšจ Your RL only improves ๐—ฝ๐—ฎ๐˜€๐˜€@๐Ÿญ, not ๐—ฝ๐—ฎ๐˜€๐˜€@๐—ธ? ๐Ÿšจ Thatโ€™s not a bug โ€” itโ€™s a ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—ผ๐—ฏ๐—ท๐—ฒ๐—ฐ๐˜๐—ถ๐˜ƒ๐—ฒ youโ€™re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. ๐Ÿงต How?
Replying to @KunhaoZ
Since pass @ K alignment is making sense (similar to human learning where having higher number of good quality students in a class makes overall performance more robust), I wonder if ordered preference during training by difficulty tiers has been tried yet? Aka - start of epoch batches should contain easy tasks and should progressively go towards difficult tasks.

Apr 28, 2025 ยท 5:43 AM UTC