This paper creates a realistic benchmark to test if AI agents can truly perform end to end LLM research.
Agents need about 6.5x more run time here than older tests, which shows the tasks are harder.
InnovatorBench packs 20 tasks across 6 areas, covering data work, loss design, reward design, and scaffolds.
Each task makes the agent code, train or run inference, then submit results scored for correctness and quality.
ResearchGym is the workbench and lets agents use many machines, background jobs, file tools, web tools, and snapshots.
An outside scoring server checks outputs and formats, so hacks and formatting tricks do not help.
The team runs a simple ReAct style agent with several strong models to set baselines across the domains.
Models do better on data tasks that allow small noise, but they struggle on loss or reward design.
Common failures are stopping long runs early, colliding on GPUs, picking slow libraries, and reusing canned reasoning.
So the benchmark exposes the gap between coding skill and full research workflow, end to end.
----
Paper – arxiv. org/abs/2510.27598
Paper Title: "InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research"