Where are our computer‑use agents (CUA) standing on OSWorld‑Verified? Potentially already ~80%.
We made this analysis, which summarizes the latest OSWorld-Verified submissions with 27 models evaluated over 369 tasks, and conducted a case study on the o3+Jedi-7B approach to explore current model capability boundaries and identify key bottlenecks in computer-use agent performance.
Overall Insights:
- Together We're There: The 80% Breakthrough: Collectively, 27 models solve 78.86% of OSWorld tasks—no single model comes close, but together they crack nearly 80%. This reveals massive potential for ensemble approaches and RL systems that can learn from diverse model behaviors.
- Alone We Struggle: The Reality Check: But here's the truth behind the 80% headline—only 11.92% of tasks achieve near-perfect performance across models, while 39.30% remain consistently difficult. We're not at 80% capability; we're at a distributed struggle across difficulty levels. Most tasks remain genuinely hard for current models, with only about 1 in 8 tasks showing strong, consistent performance.
- Planning Problems: The Critical Bottleneck: While state-of-the-art models have significantly improved general tool-use (decision) quality, their understanding of GUI concepts, knowledge, and related decision-making remains limited. Enhanced GUI interaction experience would help address these capability gaps. Planner limitations—including failure to understand the current state and getting stuck in loops—represent the critical bottleneck we urgently need to address.
- Grounding Gaps: Fine-Grained and Long-tail Failures Matter: Current open-source grounding models demonstrate satisfying performance on common patterns such as text and common icons, but still occasionally fail on uncommon icons and fine-grained operations involving text selection and tables. Besides weakness we already know, failures also often occur in unexpected places—sometimes repeatedly failing to click a checkbox, sometimes missing an unfamiliar icon—and these seemingly minor errors can often determine whether a task succeeds or fails entirely.