If you're interested in how to keep challenging neural networks throughout training, check out our latest preprint! #sample_efficiency #scaling_laws
๐Ÿš€ New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)โ€”a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. ๐Ÿ”ฅ On ImageNet-1k, DP reduces dataset size by 55 million examples while outperforming prior synthetic benchmarks! ๐Ÿ“„Paper: arxiv.org/pdf/2502.15588 ๐ŸงตKey takeaways โฌ‡๏ธ
4
22
Mohammad Pezeshki retweeted
New @AIatMeta paper explains when a smaller, curated dataset beats using everything. Standard training wastes effort because many examples are redundant or wrong. They formalize a label generator, a pruning oracle, and a learner. From this, they derive exact error laws and sharp regime switches. With a strong generator and plenty of data, keeping hard examples works best. With a weak generator or small data, keeping easy examples or keeping more helps. They analyze 2 modes, label agnostic by features and label aware that first filters wrong labels. ImageNet and LLM math results match the theory, and pruning also prevents collapse in self training. ---- Paper โ€“ arxiv. org/abs/2511.03492 Paper Title: "Why Less is More (Sometimes): A Theory of Data Curation"
We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities
1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram
8
13
Mohammad Pezeshki retweeted
After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).
The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is slow because it compares every token with every other token and stores all past keys and values. Kimi Linear speeds this up by keeping a small fixed memory per head and updating it step by step like a running summary, so compute and memory stop growing with length. Their new Kimi Delta Attention adds a per channel forget gate, which means each feature can separately decide what to keep and what to fade, so useful details remain and clutter goes away. They also add a tiny corrective update on every step, which nudges the memory toward the right mapping between keys and values instead of just piling on more data. The model stacks 3 of these fast KDA layers then 1 full attention layer, so it still gets occasional global mixing while cutting the key value cache roughly by 75%. Full attention layers run with no positional encoding, and KDA learns order and recency itself, which simplifies the stack and helps at long ranges. Under the hood, a chunkwise algorithm plus a constrained diagonal plus low rank design removes unstable divisions and drops several big matrix multiplies, so the kernels run much faster on GPUs. With the same training setup, it scores higher on common tests, long context retrieval, and math reinforcement learning, while staying fast even at 1M tokens. It drops into existing systems, saves memory, scales to 1M tokens, and improves accuracy without serving changes. ---- Paper โ€“ arxiv. org/abs/2510.26692 Paper Title: "Kimi Linear: An Expressive, Efficient Attention Architecture"
7
34
2
371
Mohammad Pezeshki retweeted
Tiny Recursion Models ๐Ÿ” meet Amortized Learners ๐Ÿง  After @jm_alexiaโ€™s great talk, realized our framework mirrors it: recursion (Nโ‚›แตคโ‚š=steps, n,T=1), detach grads but new obs each step โ†’ amortizing over long context Works across generative models, neural processes, & beyond
Meta on meta: thrilled to share our work on Meta-learningโ€ฆ at Meta! ๐Ÿ”ฅ๐Ÿง  We make two major contributions: 1๏ธโƒฃ Unified framework revealing insights into various amortizations ๐Ÿง  2๏ธโƒฃ Greedy belief-state updates to handle long context-lengths ๐Ÿš€
3
28
2
239
My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.
[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. ๐Ÿ“ŒPredict a learned embedding of the future sequence, not the tokens themselves
14
31
Alleviating long context issues: โ€‹Iterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. โ€‹IAI merges: โ€‹- Scalability of stochastic opt. (SGD). โ€‹- Expressivity of forward-pass amortization (ICL in LLMs).
Meta on meta: thrilled to share our work on Meta-learningโ€ฆ at Meta! ๐Ÿ”ฅ๐Ÿง  We make two major contributions: 1๏ธโƒฃ Unified framework revealing insights into various amortizations ๐Ÿง  2๏ธโƒฃ Greedy belief-state updates to handle long context-lengths ๐Ÿš€
1
8
20
A notable work on Red Flag Tokens for LLM safety.
๐Ÿ“ƒ New Paper Alert! โœจ A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens๐Ÿšฉ What do you think are some major limitations in current safety training approaches? โžก๏ธ We think it's in their design: they rely on completely changing the model's distribution by refusing responses when content seems harmful ("Sorry, I can't answer that"). That hard switch is often brittle, leads to overrefusals, and doesnโ€™t allow recovery from the decision if the prompt was actually benign. ๐Ÿ”ฅ We propose a post-training method to improve safety while having minimal impact on the generated distribution of natural language. Our method generates a special token excluded from the userโ€™s vocabulary, which we call a red flag๐Ÿšฉtoken, to signal when conversations with an LLM turn harmful. ๐Ÿš€ Why this helps? Cleaner eval: detecting๐Ÿšฉis objective, no judge required. Built-in generalization: works with in-context learning and transfers to other languages. Flexible use: we explore using๐Ÿšฉtokens as a hard filter or a soft trigger for safety reasoning. Check out the demo below ๐Ÿ‘‡ See the ๐Ÿงต for a deep dive! ๐Ÿ”— Paper: arxiv.org/pdf/2502.16366
1
8
A very nice read. Fixed chunks make ultra-long reasoning feasible. Very nice visualizations too! Congrats to the authors!
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
1
19
Mohammad Pezeshki retweeted
Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state โ†’ linear compute, constant memory and it keeps scaling beyond the training limit. 1/6
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
Mohammad Pezeshki retweeted
Itโ€™s clear next-gen reasoning LLMs will run for millions of tokens. RL at 1M needs ~100ร— compute than 128K. Our Markovian Thinking keeps compute scaling linear instead. Check out Miladโ€™s thread; some of my perspectives below:
Introducing linear scaling of reasoning: ๐“๐ก๐ž ๐Œ๐š๐ซ๐ค๐จ๐ฏ๐ข๐š๐ง ๐“๐ก๐ข๐ง๐ค๐ž๐ซ Reformulate RL so thinking scales ๐Ž(๐ง) ๐œ๐จ๐ฆ๐ฉ๐ฎ๐ญ๐ž, not O(n^2), with O(1) ๐ฆ๐ž๐ฆ๐จ๐ซ๐ฒ, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy ๐Ÿงต
Mohammad Pezeshki retweeted
We will be presenting Sparse Activation Steering (SAS) at #COLM2025. DM if you would like to chat. Details: Wednesday afternoon (4:30 PM โ€“ 6:30 PM), Poster number: 65
1
9
31
Mohammad Pezeshki retweeted
Excited to share that our work has been accepted to #COLM2025! We're presenting our poster at Poster Session 2 , Tuesday, Oct 7, 4:30โ€“6:30 pm (Poster #68). If youโ€™re in Montreal for @COLM_conf, Iโ€™d love to chat about generalization in LLMs and their underlying biases!
This work has been accepted to #COLM2025 . If you are in Montreal this week for @COLM_conf and would like to chat about this (or anything related to discovery / exploration / RL), drop me a note! Poster session 2: Tuesday Oct 7, 4:30-6:30pm Poster number 68
1
8
46
Mohammad Pezeshki retweeted
Introducing RSA ๐ŸŒ€ (Recursive Self-Aggregation): unlocking deep thinking with test-time scaling ๐Ÿ”ฅ Qwen3-4B + RSA > DeepSeek-R1 ๐Ÿ“ˆ Gains across Qwen, Nemo, GPT-OSS ๐Ÿ† Benchmarks: Math โ€ข Reasoning Gym โ€ข Code โšก Aggregation-aware RL lets Qwen3-4B surpass o3-mini ๐Ÿš€
1
6
1
29
Mohammad Pezeshki retweeted
Thrilled to share recent progress on LLM reasoning methods: generating multiple candidate reasoning traces and recursively refine them. RSA* implements the perfect "brain storm" within single models. Super team from @Mila_Quebec
Introducing RSA ๐ŸŒ€ (Recursive Self-Aggregation): unlocking deep thinking with test-time scaling ๐Ÿ”ฅ Qwen3-4B + RSA > DeepSeek-R1 ๐Ÿ“ˆ Gains across Qwen, Nemo, GPT-OSS ๐Ÿ† Benchmarks: Math โ€ข Reasoning Gym โ€ข Code โšก Aggregation-aware RL lets Qwen3-4B surpass o3-mini ๐Ÿš€
3
24
Probably the first negative results on distillation! Very interesting findings!
1/Excited to share the first in a series of my research updates on LLM pretraining๐Ÿš€. Our new work shows *distilled pretraining*โ€”increasingly used to train deployable modelsโ€”has trade-offs: โœ… Boosts test-time scaling โš ๏ธ Weakens in-context learning โœจ Needs tailored data curation
2
20
Mohammad Pezeshki retweeted
Today I'll teach my very first class at @Concordia. The course is COMP 6321 (Machine Learning). Looking forward to it!
1
1
25
Mohammad Pezeshki retweeted
LLMs are great at single-shot problems, but in the era of experience, interactive environments are key ๐Ÿ”‘ Introducing * Multi-Turn Puzzles (MTP) * , a new benchmark to test multi-turn reasoning and strategizing ๐Ÿ”— Paper: huggingface.co/papers/2508.1โ€ฆ ๐Ÿซ™Data: huggingface.co/datasets/ariaโ€ฆ
1
8
4
48
Mohammad Pezeshki retweeted
We built a new ๐—ฎ๐˜‚๐˜๐—ผ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ + ๐—ฅ๐—Ÿ image editing model using a strong verifier โ€” and it beats SOTA diffusion baselines using 5ร— less data. ๐Ÿ”ฅ ๐—˜๐—”๐—ฅ๐—Ÿ: a simple, scalable RL pipeline for high-quality, controllable edits. ๐Ÿงต1/
Today at #ICML2025, we present Deliberate Practice: an approach to improve sample-efficiency by generating harder, not more, examples. - Oral talk at 10:45 - West Ballroom B | Orals 3C: Data-Centric ML Join us to discuss principled approaches to more efficient learning.
Excited to present our work "Improving the scaling laws of synthetic data with deliberate practice", tomorrow at #ICML2025 ๐Ÿ“ข Oral: Wed. 10:45 AM ๐Ÿ“ West Ballroom B (Oral 3C Data-Centric ML) ๐Ÿ–ผ๏ธ Poster: ๐Ÿ•š 11:00 AM โ€“ 1:30 PM ๐Ÿ“ East Exhibition Hall A-B (Poster Session 3 East)
6
18
Mohammad Pezeshki retweeted
๐Ÿ“„ New Paper Alert! โœจ ๐Ÿš€Mixtureโ€ฏofโ€ฏRecursions (MoR): Smaller models โ€ข Higher accuracy โ€ข Greater throughput Across 135โ€ฏMโ€“1.7โ€ฏB params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher fewโ€‘shot accuracy, and more than 2x throughput. Letโ€™s break it down! ๐Ÿงต๐Ÿ‘‡
4
62
7
270