Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng

Kunhao Zheng @KunhaoZ

Apr 27

🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨 That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. 🧵 How?

133

833

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

Policy Gradient variants, like PPO, GRPO, all optimize for this objective: The correctness of each 𝗶𝗻𝗱𝗶𝘃𝗶𝗱𝘂𝗮𝗹 sample 𝘺. 👉 This optimizes for exactly the pass@1 metrics in training time. Training with a pass@1 objective won't probably yield pass@k miracles.✨

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

It's a different RL paradigm, in which the reward is not only a function of single trajectory, but a population of trajectories. Given k samples, pass@k is 1 if at least one of them is correct. ✨ We can define the reward exactly as the maximum of the individual reward.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

How does the loss looks like? We derive the following formula for the advantage and contrast that with the pass@1 objective like Dr. GRPO or RLOO.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

We observe a 𝗰𝗹𝗲𝗮𝗿 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳 between pass@1 and pass@k if you train them using different objective: If you train for pass@1, you get pass@1 increase on eval. If you train for pass@k, you get pass@k increase on eval. It's just that simple.

Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

For more details have a look at our preprint in March. arxiv.org/abs/2503.19595 Joint work with @robinphysics @syhw and 𝗥𝗲𝗺𝗶 𝗠𝘂𝗻𝗼𝘀

Optimizing Language Models for Inference Time Objectives using...

In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve...

arxiv.org

Snapolino · Apr 28, 2025 · 3:35 AM UTC

Snapolino @snapolino

Apr 28

Replying to @KunhaoZ

wouldnt this mean we could make the training objective like (pass@1 + pass@2 + ...... pass@k)/pass@n ? or instead of linear using exponential decay the more passes there are :-) it would be like self compression. if it can do it in pass@8 it gets 0.05 point, in pass@1 1.0 points

Tensor Templar · Apr 28, 2025 · 8:17 AM UTC

Tensor Templar

@TensorTemplar

Apr 28

Replying to @KunhaoZ

pass@1 misleads for code RL. It's fine for code quizzes but application code needs revisions and pass@k is the better objective there. Hardware also wants high kv cache utilization and roofline optimal choices of 'k' in practice