It's rare nowadays to find something that is intuitively important and not yet done well by any major language models.
But *precisely aggregating lots of information over long contexts* is one of those things.
Our new benchmark Oolong tests this ability, see the 🧵 for more!
Can LLMs accurately aggregate information over long, information-dense texts? Not yet…
We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!
Nov 7, 2025 · 5:40 PM UTC




