It's rare nowadays to find something that is intuitively important and not yet done well by any major language models. But *precisely aggregating lots of information over long contexts* is one of those things. Our new benchmark Oolong tests this ability, see the 🧵 for more!
Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

Nov 7, 2025 · 5:40 PM UTC

2
16
1
169
Replying to @gneubig
Interesting to see the performance decrease on longer context lengths.
1
FII9: A Lesson for Optimism. Watch the defining moments from the 3-day conference in Riyadh, where global leaders and industry pioneers came together to unlock the next frontier of growth for humanity with the Key To Prosperity. Watch full sessions and exclusive insights now on FII Institute TV
Replying to @gneubig
Context length ≠ comprehension. True models aggregate relational continuity, not just data.