๐จ Your RL only improves ๐ฝ๐ฎ๐๐@๐ญ, not ๐ฝ๐ฎ๐๐@๐ธ? ๐จ
Thatโs not a bug โ itโs a ๐ณ๐ฒ๐ฎ๐๐๐ฟ๐ฒ ๐ผ๐ณ ๐๐ต๐ฒ ๐ผ๐ฏ๐ท๐ฒ๐ฐ๐๐ถ๐๐ฒ youโre optimizing.
You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time.
๐งต How?
Since pass @ K alignment is making sense (similar to human learning where having higher number of good quality students in a class makes overall performance more robust), I wonder if ordered preference during training by difficulty tiers has been tried yet?
Aka - start of epoch batches should contain easy tasks and should progressively go towards difficult tasks.
Apr 28, 2025 ยท 5:43 AM UTC


