Mian Wu · Nov 7, 2025 · 2:42 PM UTC

Mian Wu · Nov 7, 2025 · 2:42 PM UTC

Mian Wu

Mian Wu

@MerlinNoth79247

Nov 7

Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a procedure that also trains the judge/critic dynamically during RL. The critic finds just one most likely mistake in response, the generator fixes it, and now the critic updates itself to find new mistakes... this adversarial training procedure does really well!

Nov 7, 2025 · 2:42 PM UTC

313

Mian Wu · Nov 7, 2025 · 2:49 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks. Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).

Mian Wu · Nov 7, 2025 · 2:50 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check. If the check fails, the critic is rewarded. If it passes, the generator is rewarded.

Mian Wu · Nov 7, 2025 · 2:50 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:

Mian Wu · Nov 7, 2025 · 2:57 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

Here’s how we train RLAC in practice. The generator produces answers, the critic proposes checks, the validator labels them, and we update both models. This cycle can work with any online or offline RL algorithm.

Mian Wu · Nov 7, 2025 · 2:58 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

We evaluate RLAC on factual text generation (concise biographies). It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.

Mian Wu · Nov 7, 2025 · 2:58 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.

Mian Wu · Nov 7, 2025 · 3:01 PM UTC

Mian Wu

@MerlinNoth79247

Nov 7

Check out our paper for more results and discussions! Thanks to my amazing advisors @aviral_kumar2 @svlevine @sewon__min for all their guidance and support! Website: mianwu01.github.io/RLAC_webs… Paper: arxiv.org/abs/2511.01758

RLAC: Reinforcement Learning with Adversarial Critic for Free-Form...

Open-ended generation tasks require outputs to satisfy diverse and often implicit task-specific evaluation rubrics. The sheer number of relevant rubrics leads to prohibitively high verification...

arxiv.org

Manthan Patel | Lead Gen Man · Nov 8, 2025 · 5:02 PM UTC

Manthan Patel | Lead Gen Man

@leadgenmanthan

Nov 8

Replying to @MerlinNoth79247

Really interesting direction

Arnav Salkade · Nov 8, 2025 · 4:08 PM UTC

Arnav Salkade

@itsArnz

Nov 8

Replying to @MerlinNoth79247

Wow I dint think evaluating the critiques critique response on their judgement of the accuracy of the generators response would be effective. Maybe its recursive RL all the way u need critique(critique(critique...) evaluations on hard to verify!