Obviously, bar height has to represent tokens used… I.e. it takes GPT-5 roughly 2x more tokens to get a ~23% less accurate answer than o3.
I’m fucking around but I wouldn’t be surprised if that’s what the bar height actually meant.
Aug 8, 2025 · 1:15 AM UTC

