ARC Prize · May 27, 2025 · 4:55 PM UTC

ARC Prize · May 27, 2025 · 4:55 PM UTC

ARC Prize

ARC Prize

@arcprize

May 27

Claude Sonnet 4 on ARC-AGI Semi Private Eval Base * ARC-AGI-1: 23%, $0.08/task * ARC-AGI-2: 1.2%, $0.12/task Thinking 16K * ARC-AGI-1: 40%, $0.36/task * ARC-AGI-2: 5.9%, $0.48/task Sonnet 4 sets new SOTA (5.9%) on ARC-AGI-2

May 27, 2025 · 4:55 PM UTC

597

ARC Prize · May 27, 2025 · 4:55 PM UTC

ARC Prize

@arcprize

May 27

View the leaderboard: arcprize.org/leaderboard Reproduce results: github.com/arcprize/arc-agi-…

GitHub - arcprize/arc-agi-benchmarking: Testing baseline LLMs performance across various models

Testing baseline LLMs performance across various models - arcprize/arc-agi-benchmarking

github.com

ARC Prize · May 27, 2025 · 4:55 PM UTC

ARC Prize

@arcprize

May 27

Breakdown of @AnthropicAI models across datasets and models Note: Y-axis is set to 50%

Chris · May 27, 2025 · 5:17 PM UTC

Chris

@chatgpt21

May 27

Replying to @arcprize

That’s actually really good

Jonathan Rose Dunlap · May 27, 2025 · 5:12 PM UTC

Jonathan Rose Dunlap

@JonathanRoseD

May 27

Replying to @arcprize

That's pretty notable. What about Opus 4 then?

Alexia Jolicoeur-Martineau · May 27, 2025 · 9:05 PM UTC

Alexia Jolicoeur-Martineau @jm_alexia

May 27

Replying to @arcprize

Can we have a seperate benchmark for models who process examples by text vs vision? It should be solved with vision.

Lee Penkman · May 27, 2025 · 9:00 PM UTC

Lee Penkman

@LeeLeepenkman

May 27

Replying to @arcprize

nice i wonder where opus/gemini is? ah - its green here but thats gemini 2.5 flash i wonder where gemini 2.5 non flash is

Emily · May 27, 2025 · 6:26 PM UTC

Emily

@IamEmily2050

May 27

Replying to @arcprize

Yes Sonnet thinking really good 👌

Lee Penkman · May 27, 2025 · 9:03 PM UTC

Lee Penkman

@LeeLeepenkman

May 27

Replying to @arcprize

looks like the triangles is the arc agi 2 test so sonnet 4 really breakout performance there

Lee Penkman · May 27, 2025 · 9:02 PM UTC

Lee Penkman

@LeeLeepenkman

May 27

Replying to @arcprize

looks like openai clearly did some real serious pruning to get o3 medium into prod where they spend lots of inference time compute on the o3 previous

Mihai ᛋ · May 28, 2025 · 6:41 PM UTC

Mihai ᛋ @mihai673

May 28

Replying to @arcprize

wen Opus

Douglas Schonholtz · May 27, 2025 · 10:22 PM UTC

Douglas Schonholtz

@Douglas_Schon

May 27

Replying to @arcprize

Can you guys make your charts readable? Understanding the chart shouldn’t be an ARC-AGI challenge. not clear what is challenge 1 or 2 Especially since on mobile you can’t zoom in or out on the chart to see the legend

Roshan · May 27, 2025 · 6:31 PM UTC

Roshan

@meta_x_ai

May 27

Replying to @arcprize

Where is Gemini Pro 2.5?

Shawn · May 27, 2025 · 5:23 PM UTC

Shawn

@Shawnryan96

May 27

Replying to @arcprize

This to me is proof we need more than RL and LC for adaptability. We need real continous learning and layered memory.

carpe noctem · May 27, 2025 · 6:11 PM UTC

carpe noctem @vaqueh

May 27

Replying to @arcprize

Where Gemini 2.5pro

ElonBald · May 27, 2025 · 9:06 PM UTC

ElonBald

@ElonBaldMusk

May 27

Replying to @arcprize

The chart starts to look like a mess, maybe attach two images, one for ARC1 another for ARC2?

Alt infiniti · May 27, 2025 · 6:35 PM UTC

Alt infiniti

@AltInfinit775

May 27

Replying to @arcprize

When Gemini 2.5 Pro?

Permaximum Judicium · May 27, 2025 · 9:43 PM UTC

Permaximum Judicium

@permaximum88

May 27

Replying to @arcprize

5.9% on ARC-AGI-2 is no small feat. Curious about Opus.

Pucci · May 27, 2025 · 5:59 PM UTC

Pucci

@PucciBets

May 27

Replying to @arcprize

that is pretty good actually

Philippe Tremblay · May 27, 2025 · 6:17 PM UTC

Philippe Tremblay

@philtrem22

May 27

Replying to @arcprize

Matthaios Markatis · May 28, 2025 · 8:49 AM UTC

Matthaios Markatis @MattMarkatis

May 28

Replying to @arcprize

I give it another 8-9 months till ARC AGI 2 is in the 80% ish area from a model.

L.E.D.P. · May 28, 2025 · 6:07 PM UTC

L.E.D.P. @PT4n1

May 28

Replying to @arcprize @MLStreetTalk

o3 (low) stronger on ARC AGI1 though - how come not better? 🤔

NoAdVance · May 29, 2025 · 10:41 AM UTC

NoAdVance @T3RR0RC01N

May 29

Replying to @arcprize

Looking forward to the newly updated Deepseek-R1's performance on ARC.

Hillary "AI_Pioneer" Mason · May 28, 2025 · 7:28 AM UTC

Hillary "AI_Pioneer" Mason @AiHub_Pioneer

May 28

Replying to @arcprize

The subtle improvement with ARC-AGI-2 might suggest potential for efficiency-based advancements.

Roy CypherMind · May 28, 2025 · 7:12 AM UTC

Roy CypherMind @CypherM1nd

May 28

Replying to @arcprize

The significant improvement on Thought vs. Eval metrics is interesting to observe.

Hunter Santoro · May 28, 2025 · 7:07 AM UTC

Hunter Santoro @hunter_5o

May 28

Replying to @arcprize

What makes ARC-AGI evaluations significant?

Cloud Pioneer · May 28, 2025 · 10:01 AM UTC

Cloud Pioneer @cloud_p1oneer

May 28

Replying to @arcprize

The testing efficiency difference is significant here.

MetaSage_Lady · May 28, 2025 · 8:59 AM UTC

MetaSage_Lady @MetaSage_Lady

May 28

Replying to @arcprize

How does cost compare with performance improvement?

Ping Paul · May 28, 2025 · 6:34 AM UTC

Ping Paul @pay_pau1

May 28

Replying to @arcprize

Evaluative results contrast in pricing efficiency.

BoTie · May 27, 2025 · 6:18 PM UTC

BoTie @BoTie137

May 27

Replying to @arcprize

When opus?

Carmen Tech · May 28, 2025 · 8:09 AM UTC

Carmen Tech @CarmTech_

May 28

Replying to @arcprize

A marginal yet noteworthy improvement.

Algo Ace · May 28, 2025 · 6:28 AM UTC

Algo Ace @alg0Ace

May 28

Replying to @arcprize

How does this affect ARC-AGI-1 performance?

Ant A · May 27, 2025 · 5:50 PM UTC

Ant A

@AntDX316

May 27

Replying to @arcprize

👍

Shane Synth · May 28, 2025 · 6:50 AM UTC

Shane Synth @5haneSynth

May 28

Replying to @arcprize

A substantial leap in performance efficiency observed.

Daniel Wolf · May 28, 2025 · 7:07 AM UTC

Daniel Wolf @DanielWolf18

May 28

Replying to @arcprize

but in arc-agi 1 it is not even paretofrontier: gets "eaten" by o4-mini (medium)

Nexus Bettina · May 28, 2025 · 6:42 AM UTC

Nexus Bettina @nexusBettina

May 28

Replying to @arcprize

The ongoing cost increase is notable.

This tweet is unavailable