Can we run RL to train LLMs on hard-to-verify or open-ended tasks? Even when tasks are verifiable, it is often impossible to check every design detail or catch all mistakes.. We can go prompt-tune LLM judges, but is that really the answer?
Our new paper introduces RLAC: a procedure that also trains the judge/critic dynamically during RL. The critic finds just one most likely mistake in response, the generator fixes it, and now the critic updates itself to find new mistakes... this adversarial training procedure does really well!
Nov 7, 2025 · 2:42 PM UTC
Why is this hard? In free-form tasks (long outputs, code, math proofs), an answer may satisfy many hidden checks.
Checking all of them is expensive, so RL post-training either enumerates checks (accurate but slow) or uses one learned reward score (cheap but easy to game).
RLAC takes a third approach. For each answer, a critic tries to pinpoint the most likely mistake as a check (what we call a rubric in the paper), and an external validator tests that check.
If the check fails, the critic is rewarded. If it passes, the generator is rewarded.
Formally, satisfying all rubrics is equivalent to a minimum over rubrics, which yields a min–max objective:
Here’s how we train RLAC in practice.
The generator produces answers, the critic proposes checks, the validator labels them, and we update both models.
This cycle can work with any online or offline RL algorithm.
We evaluate RLAC on factual text generation (concise biographies).
It produces more accurate outputs while using far fewer validator calls, up to around 5–6× fewer checks for longer biographies.
RLAC also works well on code generation tasks, consistently achieving better performance and requiring fewer testcase executions than both enumerative and reward-model approaches on most benchmarks.
Check out our paper for more results and discussions!
Thanks to my amazing advisors @aviral_kumar2 @svlevine @sewon__min for all their guidance and support!
Website: mianwu01.github.io/RLAC_webs…
Paper: arxiv.org/abs/2511.01758










