Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng

Kunhao Zheng @KunhaoZ

Apr 27

🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨 That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. 🧵 How?

133

833

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

Policy Gradient variants, like PPO, GRPO, all optimize for this objective: The correctness of each 𝗶𝗻𝗱𝗶𝘃𝗶𝗱𝘂𝗮𝗹 sample 𝘺. 👉 This optimizes for exactly the pass@1 metrics in training time. Training with a pass@1 objective won't probably yield pass@k miracles.✨

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

It's a different RL paradigm, in which the reward is not only a function of single trajectory, but a population of trajectories. Given k samples, pass@k is 1 if at least one of them is correct. ✨ We can define the reward exactly as the maximum of the individual reward.

Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

How does the loss looks like? We derive the following formula for the advantage and contrast that with the pass@1 objective like Dr. GRPO or RLOO.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

We observe a 𝗰𝗹𝗲𝗮𝗿 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳 between pass@1 and pass@k if you train them using different objective: If you train for pass@1, you get pass@1 increase on eval. If you train for pass@k, you get pass@k increase on eval. It's just that simple.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

For more details have a look at our preprint in March. arxiv.org/abs/2503.19595 Joint work with @robinphysics @syhw and 𝗥𝗲𝗺𝗶 𝗠𝘂𝗻𝗼𝘀

Optimizing Language Models for Inference Time Objectives using...

In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve...

arxiv.org

Jeffrey Li 💙💛 · Apr 28, 2025 · 4:57 AM UTC

Jeffrey Li 💙💛 @askerlee

Apr 28

Replying to @KunhaoZ

Won't this encourage many (k-1 out of k) noisy trajectories? The wrongly encouraged trajectories have to be suppressed in other iterations, and i guess it will greatly slow down convergence?