In collaboration with @CommonCrawl @MLCommons @AiEleuther, the first edition of WMDQS at @COLM_conf starts tomorrow in Room 520A! We have an updated schedule on our website, including a list of all accepted papers.

Oct 9, 2025 · 8:17 PM UTC

Our first keynote will be from Julia Kreutzer about data for multilingual fine-tuning.
1
2
Our second keynote will be by @davlanade about text quality for low-resource languages.
1
1
3
Our third and final keynote will be from @sebnagel about the data in Common Crawl.
1
1
We will also have a session on our shared task, which was about improving language identification models. Participants of the shared task contributed annotations to create a new LangID dataset and also submitted new LangID systems.
1
1