Claude Sonnet 4 on ARC-AGI Semi Private Eval Base * ARC-AGI-1: 23%, $0.08/task * ARC-AGI-2: 1.2%, $0.12/task Thinking 16K * ARC-AGI-1: 40%, $0.36/task * ARC-AGI-2: 5.9%, $0.48/task Sonnet 4 sets new SOTA (5.9%) on ARC-AGI-2

May 27, 2025 · 4:55 PM UTC

Replying to @arcprize
That’s actually really good
27
Replying to @arcprize
That's pretty notable. What about Opus 4 then?
12
Replying to @arcprize
Can we have a seperate benchmark for models who process examples by text vs vision? It should be solved with vision.
6
Replying to @arcprize
nice i wonder where opus/gemini is? ah - its green here but thats gemini 2.5 flash i wonder where gemini 2.5 non flash is
1
3
Replying to @arcprize
Yes Sonnet thinking really good 👌
1
Replying to @arcprize
looks like the triangles is the arc agi 2 test so sonnet 4 really breakout performance there
1
Replying to @arcprize
looks like openai clearly did some real serious pruning to get o3 medium into prod where they spend lots of inference time compute on the o3 previous
Replying to @arcprize
wen Opus
Replying to @arcprize
Can you guys make your charts readable? Understanding the chart shouldn’t be an ARC-AGI challenge. not clear what is challenge 1 or 2 Especially since on mobile you can’t zoom in or out on the chart to see the legend
9
Replying to @arcprize
Where is Gemini Pro 2.5?
8
Replying to @arcprize
This to me is proof we need more than RL and LC for adaptability. We need real continous learning and layered memory.
1
1
5
Replying to @arcprize
Where Gemini 2.5pro
7
Replying to @arcprize
The chart starts to look like a mess, maybe attach two images, one for ARC1 another for ARC2?
6
Replying to @arcprize
When Gemini 2.5 Pro?
4
Replying to @arcprize
5.9% on ARC-AGI-2 is no small feat. Curious about Opus.
1
3
Replying to @arcprize
that is pretty good actually
2
Replying to @arcprize
I give it another 8-9 months till ARC AGI 2 is in the 80% ish area from a model.
1
o3 (low) stronger on ARC AGI1 though - how come not better? 🤔
Replying to @arcprize
Looking forward to the newly updated Deepseek-R1's performance on ARC.
Replying to @arcprize
The subtle improvement with ARC-AGI-2 might suggest potential for efficiency-based advancements.
Replying to @arcprize
The significant improvement on Thought vs. Eval metrics is interesting to observe.
Replying to @arcprize
What makes ARC-AGI evaluations significant?
Replying to @arcprize
The testing efficiency difference is significant here.
Replying to @arcprize
How does cost compare with performance improvement?
Replying to @arcprize
Evaluative results contrast in pricing efficiency.
Replying to @arcprize
When opus?
Replying to @arcprize
A marginal yet noteworthy improvement.
Replying to @arcprize
How does this affect ARC-AGI-1 performance?
Replying to @arcprize
👍
Replying to @arcprize
A substantial leap in performance efficiency observed.
Replying to @arcprize
but in arc-agi 1 it is not even paretofrontier: gets "eaten" by o4-mini (medium)
Replying to @arcprize
The ongoing cost increase is notable.
This tweet is unavailable