🎮 Computer Use Agent Arena is LIVE! 🚀
🔥 Easiest way to test computer-use agents in the wild without any setup
🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more
🕹️ Test agents on 100+ real apps & webs with one-click config
🔒 Safe & free access on cloud-hosted machines
Page:
arena.xlang.ai
Leaderboard (tentative):
arena.xlang.ai/leaderboard
Blog:
arena.xlang.ai/blog/computer…
Data & Code (coming soon):
github.com/xlang-ai/computer…
⭐️Why Computer Agent Arena?
1️⃣Beyond Static Benchmarks: We use computers to perform enormous tasks and workflows every day, and AI agents have the potential to automate these tasks. However, existing benchmarks are very limited (e.g., only 369 tasks in OSWorld and 812 tasks in WebArena). To better measure their capabilities, we introduce Computer Agent Arena for users to easily compare & test AI agents on all kinds of crowdsourced real-world computer use tasks.
2️⃣Cloud Testing, Simplified: As agents like OpenAI’s Operator and Claude 3.7 sonnet release, users face configuration challenges and privacy hurdles to deploy on their own computers. Our platform integrates these agents with cloud-hosted machines, providing users with quick and secure access.
3️⃣Unified Embodied Digital Environment: Unlike Chatbot Arena, we provide users with a real embodied environment—computers—where all agents are grounded in real computer tasks and environments.
Led by @XLANG_Lab [1/🧵]