Results for RL+LLM are mixed, with positive results where reasoning chains are short, the LLM already has a sense of what to do, and we already know the answer — leaving us to exhaust our finite pool of supervised data
Same for repeated sampling
Time for exploration bonuses
New paper alert: Unifies insights from Limit-of-RLVR and ProRL — does current RLVR actually expand reasoning?
Turns out: RLVR is mostly an efficient sampler with shrinking, very rarely an explorer with explanding.
Explore is holy grail for LLM and may entail beyond 0/1 reward.