Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer? Our new paper introduces RLAC: a procedure that also trains the judge/critic dynamically during RL. The critic finds just one most likely mistake in response, the generator fixes it, and now the critic updates itself to find new mistakes... this adversarial training procedure does really well!

Nov 7, 2025 · 2:42 PM UTC

Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks. Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).
1
2
15
0
RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check. If the check fails, the critic is rewarded. If it passes, the generator is rewarded.
1
1
9
Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:
1
3
8
Here’s how we train RLAC in practice. The generator produces answers, the critic proposes checks, the validator labels them, and we update both models. This cycle can work with any online or offline RL algorithm.
1
3
9
We evaluate RLAC on factual text generation (concise biographies). It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.
1
3
7
RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.
1
2
6
Replying to @MerlinNoth79247
Really interesting direction
Replying to @MerlinNoth79247
Wow I dint think evaluating the critiques critique response on their judgement of the accuracy of the generators response would be effective. Maybe its recursive RL all the way u need critique(critique(critique...) evaluations on hard to verify!
Replying to @MerlinNoth79247
Awesome work Mian. Different approach and a new way to craft.
Replying to @MerlinNoth79247
Impressive approach to improving LLMs through RLAC! Exciting progress!
Replying to @MerlinNoth79247
Mian, that's a brilliant question, and RLAC definitely sounds like a promising approach to tackle the challenges you've raised, no?
Replying to @MerlinNoth79247
Interesting. Prompt tuning LLM judges sounds like a difficult task.