We’ve identified a “Collaboration Gap” in today’s top AI models. Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves. A \🧵

Nov 5, 2025 · 12:25 PM UTC

Why does this matter? The future of AI won’t be one giant model; it’s systems of multiple, independent AI agents w/ different information and skills. The success of such systems will critically depend on effective collaboration. But how do we measure collaborative capabilities?
1
5
Real-world communication: Current multi-agent systems rely on *pre-defined* communication protocols, e.g., MCP, or central orchestration. In contrast, open-world integration likely requires adaptive, *dynamic* communication – something humans are surprisingly good at!
1
4
How did we measure this? We designed a collaborative maze-solving benchmark that *isolates* collaborative capabilities. The twist: no agent gets the full map. We split the info, giving each agent a partial view. The *only* way to solve the maze is to talk, share & agree on moves
1
4
Why is this hard? By splitting up information and requiring agreement, agents have to engage in “grounding” -- are shared information and actions understood the same way by both agents? Failure to ground has consequences (see image).
1
5
Stronger models are better at grounding than weaker models: 🟢 Strong collaborators (left) immediately define a coordinate system and share info. 🔴 Weak ones (right) are vague, leading to confusion, disagreement, and failure.
1
6
The Collaboration Gap: Even when models are *really* good at completing mazes solo, requiring them to solve the *same* mazes with independent copies of themselves can drastically reduce performance. This gap is especially pronounced in distilled models.
1
4
Letting models with different strengths and from different builders collaborate provides further insights: ordering and cross-family pairings matter, a *lot*. Generally: strong model starts > weak models starts, even though both need to agree on each move!
1
4
Because which model starts has such a pronounced impact on success, we experimented with a “relay” inference strategy: Have a strong (expensive) model “prime” the dialogue with just the first K messages, then hand off to a weaker (cheaper) model to finish.
1
3
Alternatively, we could use a strong model to “recover” a dialogue: a) Strong Primer: Just one strong "priming" message (K=2) lets a weak model perform near the strong model's level. b) Strong Recovery: If weak models start, a strong model struggles to recover the session.
1
2
Our findings argue that collaboration is a distinct capability that current training strategies fail to capture. We shouldn’t just hope for it to emerge – we must *design* for it. This means new evals, training strategies, and interaction designs.
1
7
This research was done during my internship @MSFTResearch. Thank you to my awesome collaborators! @adamfourney @SaleemaAmershi @cervisiarius @erichorvitz @ecekamar Read the full paper here: > arxiv.org/abs/2511.02687 And a lighter blogpost: > trdavidson.com/collaboration…
6