We’ve identified a “Collaboration Gap” in today’s top AI models.
Testing 32 leading LMs on our novel maze-solving benchmark, we found that models that excel solo can see their performance *collapse* when required to collaborate – even with an identical copy of themselves.
A \🧵
Nov 5, 2025 · 12:25 PM UTC
Why does this matter? The future of AI won’t be one giant model; it’s systems of multiple, independent AI agents w/ different information and skills.
The success of such systems will critically depend on effective collaboration. But how do we measure collaborative capabilities?
Real-world communication: Current multi-agent systems rely on *pre-defined* communication protocols, e.g., MCP, or central orchestration.
In contrast, open-world integration likely requires adaptive, *dynamic* communication – something humans are surprisingly good at!
How did we measure this? We designed a collaborative maze-solving benchmark that *isolates* collaborative capabilities.
The twist: no agent gets the full map. We split the info, giving each agent a partial view. The *only* way to solve the maze is to talk, share & agree on moves
Why is this hard? By splitting up information and requiring agreement, agents have to engage in “grounding” -- are shared information and actions understood the same way by both agents?
Failure to ground has consequences (see image).
Stronger models are better at grounding than weaker models:
🟢 Strong collaborators (left) immediately define a coordinate system and share info.
🔴 Weak ones (right) are vague, leading to confusion, disagreement, and failure.
The Collaboration Gap: Even when models are *really* good at completing mazes solo, requiring them to solve the *same* mazes with independent copies of themselves can drastically reduce performance. This gap is especially pronounced in distilled models.
Letting models with different strengths and from different builders collaborate provides further insights: ordering and cross-family pairings matter, a *lot*.
Generally: strong model starts > weak models starts, even though both need to agree on each move!
Because which model starts has such a pronounced impact on success, we experimented with a “relay” inference strategy: Have a strong (expensive) model “prime” the dialogue with just the first K messages, then hand off to a weaker (cheaper) model to finish.
Alternatively, we could use a strong model to “recover” a dialogue:
a) Strong Primer: Just one strong "priming" message (K=2) lets a weak model perform near the strong model's level.
b) Strong Recovery: If weak models start, a strong model struggles to recover the session.
Our findings argue that collaboration is a distinct capability that current training strategies fail to capture.
We shouldn’t just hope for it to emerge – we must *design* for it. This means new evals, training strategies, and interaction designs.
This research was done during my internship @MSFTResearch. Thank you to my awesome collaborators! @adamfourney @SaleemaAmershi @cervisiarius @erichorvitz @ecekamar
Read the full paper here:
> arxiv.org/abs/2511.02687
And a lighter blogpost:
> trdavidson.com/collaboration…









