Graham Neubig · Nov 7, 2025 · 5:40 PM UTC

Graham Neubig · Nov 7, 2025 · 5:40 PM UTC

Graham Neubig

Graham Neubig

@gneubig

Nov 7

It's rare nowadays to find something that is intuitively important and not yet done well by any major language models. But *precisely aggregating lots of information over long contexts* is one of those things. Our new benchmark Oolong tests this ability, see the 🧵 for more!

Amanda Bertsch @abertsch72

Nov 7

Can LLMs accurately aggregate information over long, information-dense texts? Not yet… We introduce Oolong, a dataset of simple-to-verify information aggregation questions over long inputs. No model achieves >50% accuracy at 128K on Oolong!

Nov 7, 2025 · 5:40 PM UTC

169

Min Chon Chi · Nov 8, 2025 · 6:19 PM UTC

Min Chon Chi @MinChonChiSF

22h

Replying to @gneubig

Interesting to see the performance decrease on longer context lengths.

FII Institute · Nov 2, 2025 · 10:48 AM UTC

FII Institute

@FIIKSA

Nov 2

FII9: A Lesson for Optimism. Watch the defining moments from the 3-day conference in Riyadh, where global leaders and industry pioneers came together to unlock the next frontier of growth for humanity with the Key To Prosperity. Watch full sessions and exclusive insights now on FII Institute TV

223

Gerald Johnson · Nov 8, 2025 · 10:53 PM UTC

Gerald Johnson @John22850Gerald

18h

Replying to @gneubig

Context length ≠ comprehension. True models aggregate relational continuity, not just data.