We hosted a group of researchers for dinner at Conference on Language Modeling (COLM). Some takeaways from the conf:
#COLM2025
(1) Qwen team planned on scaling training tokens from 10T to 100T, model parameters from 1T to 10T, and context length from 1M to 10M tokens.
@natolambert
(2) We've learned that distilling larger reasoning models is actually more effective than pure reinforcement learning for small/medium sized models (~32B).
@_lewtun
(3) Today the most intelligent models remain closed, but this may not matter as much given plethora of training and data techniques employed to get a group of smaller models to outperform larger models on specific tasks such as encouraging collaboration.
@jacobeisenstein @adamjfisch
(4) We saw the rise of small language models for multi modal tasks. For example, Hugging Face had a paper at COLM which achieves state of the art performance for Vision Language Models outperforming models 300x larger.
@LoubnaBenAllal1 @Thom_Wolf