Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng

Kunhao Zheng @KunhaoZ

Apr 27

🚨 Your RL only improves 𝗽𝗮𝘀𝘀@𝟭, not 𝗽𝗮𝘀𝘀@𝗸? 🚨 That’s not a bug — it’s a 𝗳𝗲𝗮𝘁𝘂𝗿𝗲 𝗼𝗳 𝘁𝗵𝗲 𝗼𝗯𝗷𝗲𝗰𝘁𝗶𝘃𝗲 you’re optimizing. You get what you optimize for. If you want better pass@k, you need to optimize for pass@k at training time. 🧵 How?

Apr 27, 2025 · 4:30 PM UTC

133

833

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

Policy Gradient variants, like PPO, GRPO, all optimize for this objective: The correctness of each 𝗶𝗻𝗱𝗶𝘃𝗶𝗱𝘂𝗮𝗹 sample 𝘺. 👉 This optimizes for exactly the pass@1 metrics in training time. Training with a pass@1 objective won't probably yield pass@k miracles.✨

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

It's a different RL paradigm, in which the reward is not only a function of single trajectory, but a population of trajectories. Given k samples, pass@k is 1 if at least one of them is correct. ✨ We can define the reward exactly as the maximum of the individual reward.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

How does the loss looks like? We derive the following formula for the advantage and contrast that with the pass@1 objective like Dr. GRPO or RLOO.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

We observe a 𝗰𝗹𝗲𝗮𝗿 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳 between pass@1 and pass@k if you train them using different objective: If you train for pass@1, you get pass@1 increase on eval. If you train for pass@k, you get pass@k increase on eval. It's just that simple.

Kunhao Zheng · Apr 27, 2025 · 4:30 PM UTC

Kunhao Zheng @KunhaoZ

Apr 27

For more details have a look at our preprint in March. arxiv.org/abs/2503.19595 Joint work with @robinphysics @syhw and 𝗥𝗲𝗺𝗶 𝗠𝘂𝗻𝗼𝘀

Optimizing Language Models for Inference Time Objectives using...

In this work, we investigate the merits of explicitly optimizing for inference time algorithmic performance during model training. We show how optimizing for inference time performance can improve...

arxiv.org

Saurabh Shah · Apr 28, 2025 · 1:28 PM UTC

Saurabh Shah

@saurabh_shah2

Apr 28

Replying to @KunhaoZ

Awesome thread!!! Have you tried training with the pass@k objective and then training the pass@1 objective on top of this? I’m curious if that gets better pass@1 performance than just training pass@1

Kunhao Zheng · Apr 30, 2025 · 3:31 PM UTC

Kunhao Zheng @KunhaoZ

Apr 30

Nice suggestion! We haven’t tried this ordering of switching the objective but definitely a good exp to run!

Matthew Macfarlane · Apr 28, 2025 · 4:53 PM UTC

Matthew Macfarlane

@MattVMacfarlane

Apr 28

Replying to @KunhaoZ

You need to cite previous work that already has derived optimizing for best of K using RL, for combinatorial optimization (arxiv.org/abs/2210.03475)

Winner Takes It All: Training Performant RL Populations for...

Applying reinforcement learning (RL) to combinatorial optimization problems is attractive as it removes the need for expert knowledge or pre-solved instances. However, it is unrealistic to expect...

arxiv.org

Kunhao Zheng · Apr 28, 2025 · 5:01 PM UTC

Kunhao Zheng @KunhaoZ

Apr 28

Thank you for the pointer looks indeed relevant!

panyinxu · Apr 28, 2025 · 6:13 AM UTC

panyinxu @pnynx3

Apr 28

Replying to @KunhaoZ

What if you set the loss = pass@1 objective + pass@k objective, will the pass@1 and pass@k increase together?

Kunhao Zheng · Apr 28, 2025 · 9:21 AM UTC

Kunhao Zheng @KunhaoZ

Apr 28

Of course we can do this: looks like the most obvious way to try out. But that’s one of many combinations we can play with, for example, logsumexp as the “softmax” to bridge the 2 objectives.

aidan ewart · Apr 27, 2025 · 11:36 PM UTC

aidan ewart

@aidanprattewart

Apr 27

Replying to @KunhaoZ

Do people *want* pass@k for many (RL) tasks? AFAICT people mostly use pass@k as a proxy for “mode-collapsedness” or similar. I suspect the thing people want is “good performance while not sacrificing diversity”, although sure mixtures of pass@{1,k} seem like decent-enough proxies

sdmat · Apr 28, 2025 · 3:15 AM UTC

sdmat

@sdmat123

Apr 28

Replying to @KunhaoZ

Is anyone working to make this an inference-time tradeoff? I.e. a system parameter analogous to temperature but for "candidate solution space" rather than token space, where 0 is maximum pass@1?

Josh Cason · Apr 27, 2025 · 10:56 PM UTC

Josh Cason

@TheGrizztronic

Apr 27

Replying to @KunhaoZ

I guess the point would be that it's at odds with test time scaling? Of course, you get there through train time scaling, so to speak.

linghui · Apr 28, 2025 · 3:16 AM UTC

linghui @linghui35877581

Apr 28

Replying to @KunhaoZ

Genius!

Minami-su · Apr 28, 2025 · 2:20 AM UTC

Minami-su @yinglia69024808

Apr 28

Replying to @KunhaoZ

Great insight!

Naresh R Shah · Apr 28, 2025 · 5:43 AM UTC

Naresh R Shah

@nareshshah139

Apr 28

Replying to @KunhaoZ

Since pass @ K alignment is making sense (similar to human learning where having higher number of good quality students in a class makes overall performance more robust), I wonder if ordered preference during training by difficulty tiers has been tried yet? Aka - start of epoch batches should contain easy tasks and should progressively go towards difficult tasks.

Mellen Y. Pu · Apr 28, 2025 · 7:48 AM UTC

Mellen Y. Pu @CassielYM

Apr 28

Replying to @KunhaoZ

pretty interesting, sorry I'm not your domain, just wondering what is the Pass@1 and Pass@k?