Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

Aarush Sah

@aarush

Oct 10

OPENBENCH 0.5.0 IS HERE It’s our biggest release yet - We added 350+ new evals, added ARC-AGI support, a plugin system for external benchmarks, provider routing, coding harnesses you can mix and match, tool‑calling evals, and more. Details in thread 🧵

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

2/ We partnered with the @arcprize Foundation to add ARC-AGI 1 & 2 to openbench. ARC-AGI is different from other benchmarks because it focuses on “fluid intelligence” - the ability for a model to reason through a problem that’s not based on general knowledge. run it yourself with this command:

Oct 10, 2025 · 6:34 PM UTC

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

3/ 350+ new benchmarks Big Expansion: We added over 350 new benchmarks to OpenBench - BigBench (122), BBH (18), Global‑MMLU (42 langs), plus GLUE/ SuperGLUE, BLiMP, AGIEval, Arabic Exams (41) Find a full list of the benchmarks available here: openbench.dev/benchmarks/cat…

Benchmarks Catalog - openbench

Complete catalog of all available benchmarks with evaluation commands

openbench.dev

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

4/ Plugins register your own benchmarks into openbench via Python entry points! You can even override built‑ins cleanly - no forking required. Docs are here - openbench.dev/development/ar…

Architecture - openbench

Understanding openbench's design and implementation

openbench.dev

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

5/ A Coding benchmark! Exercism is a coding benchmark that evaluates AI code agents’ ability to solve real-world programming exercises across multiple programming languages. You can swap harnesses and models to find the best combo - (aider, roo, claude code, opencode). (P.S. - we have some research on this coming soon 👀) How to run: openbench.dev/evals/exercism

Exercism - Multi-Language Code Agent Benchmark - openbench

Real-world coding exercises evaluating code agents across 5 programming languages and multiple agent harnesses

openbench.dev

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

6/ Tool use matters: We’ve added a tool‑calling benchmark to OpenBench via LiveMCPBench! You can now evaluate not just answers, but how reliably your model orchestrates tools end‑to‑end. Docs on how to get started: openbench.dev/evals/livemcpb…

LiveMCPBench - MCP Tool Calling Evaluation - openbench

Evaluates how effectively language models can navigate and utilize the Model Context Protocol ecosystem

openbench.dev

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

7/ Better @OpenRouterAI support! Thanks to the good folks at OpenRouter, you can now specify provider routing for models, with fallbacks, only, order, ignore, etc. Docs incoming :)

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

8/ Various DevX upgrades!

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

9/ Thank you to the @arcprize Foundation, @GregKamradt, and the broader community for the partnerships, datasets, and contributions - and to everyone shipping evals every day. We hope 0.5.0 helps you measure model quality with clarity (and speed). Full release notes: openbench.dev/release-notes

Release Notes - v0.5.0 - openbench

OpenBench 0.5.0: ARC-AGI, 350+ new evals, plugins, provider routing, coding harnesses, tool-calling, and more

openbench.dev

Aarush Sah · Oct 10, 2025 · 6:34 PM UTC

Aarush Sah

@aarush

Oct 10

10/ If you’re interested in seeing more about openbench, give us a star on GitHub: github.com/groq/openbench

GitHub - groq/openbench: Provider-agnostic, open-source evaluation infrastructure for language...

Provider-agnostic, open-source evaluation infrastructure for language models - groq/openbench

github.com