Mohammad Pezeshki · Mar 3, 2025 · 5:11 PM UTC

Mohammad Pezeshki

Pinned Tweet

Mohammad Pezeshki @mpezeshki91

Mar 3

If you're interested in how to keep challenging neural networks throughout training, check out our latest preprint! #sample_efficiency #scaling_laws

Reyhane Askari @ReyhaneAskari

Feb 28

🚀 New Paper Alert! Can we generate informative synthetic data that truly helps a downstream learner? Introducing Deliberate Practice for Synthetic Data (DP)—a dynamic framework that focuses on where the model struggles most to generate useful synthetic training examples. 🔥 On ImageNet-1k, DP reduces dataset size by 55 million examples while outperforming prior synthetic benchmarks! 📄Paper: arxiv.org/pdf/2502.15588 🧵Key takeaways ⬇️

Rohan Paul · Nov 8, 2025 · 9:02 AM UTC

Mohammad Pezeshki retweeted

Rohan Paul

@rohanpaul_ai

18h

New @AIatMeta paper explains when a smaller, curated dataset beats using everything. Standard training wastes effort because many examples are redundant or wrong. They formalize a label generator, a pruning oracle, and a learner. From this, they derive exact error laws and sharp regime switches. With a strong generator and plenty of data, keeping hard examples works best. With a weak generator or small data, keeping easy examples or keeping more helps. They analyze 2 modes, label agnostic by features and label aware that first filters wrong labels. ImageNet and LLM math results match the theory, and pruning also prevents collapse in self training. ---- Paper – arxiv. org/abs/2511.03492 Paper Title: "Why Less is More (Sometimes): A Theory of Data Curation"

338

Mohammad Pezeshki · Nov 7, 2025 · 12:45 AM UTC

Mohammad Pezeshki @mpezeshki91

Nov 7

We show a phase transition for optimal data curation: For strong models, concentrating on difficult samples drives further improvement (LIMO). In contrast, weaker models benefit from the conventional "More is More" where broad data exposure is essential to learn core capabilities

Elvis Dohmatob @dohmatobelvis

Nov 6

1/n "Less is More" (s1, etc.) vs "More is More", which mantra is correct for the training/fine-tuning large LLMs? In our recent preprint, we reconcile both of these. They correspond to different parts of a complex phase diagram

Amirhossein Kazemnejad · Nov 3, 2025 · 6:04 PM UTC

Mohammad Pezeshki retweeted

Amirhossein Kazemnejad @a_kazemnejad

Nov 3

After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).

Rohan Paul

@rohanpaul_ai

Nov 1

The brilliant Kimi Linear paper. It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster. It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context. Full attention is slow because it compares every token with every other token and stores all past keys and values. Kimi Linear speeds this up by keeping a small fixed memory per head and updating it step by step like a running summary, so compute and memory stop growing with length. Their new Kimi Delta Attention adds a per channel forget gate, which means each feature can separately decide what to keep and what to fade, so useful details remain and clutter goes away. They also add a tiny corrective update on every step, which nudges the memory toward the right mapping between keys and values instead of just piling on more data. The model stacks 3 of these fast KDA layers then 1 full attention layer, so it still gets occasional global mixing while cutting the key value cache roughly by 75%. Full attention layers run with no positional encoding, and KDA learns order and recency itself, which simplifies the stack and helps at long ranges. Under the hood, a chunkwise algorithm plus a constrained diagonal plus low rank design removes unstable divisions and drops several big matrix multiplies, so the kernels run much faster on GPUs. With the same training setup, it scores higher on common tests, long context retrieval, and math reinforcement learning, while staying fast even at 1M tokens. It drops into existing systems, saves memory, scales to 1M tokens, and improves accuracy without serving changes. ---- Paper – arxiv. org/abs/2510.26692 Paper Title: "Kimi Linear: An Expressive, Efficient Attention Architecture"

371

Sarthak Mittal · Nov 3, 2025 · 1:08 AM UTC

Mohammad Pezeshki retweeted

Sarthak Mittal @sarthmit

Nov 3

Tiny Recursion Models 🔁 meet Amortized Learners 🧠 After @jm_alexia’s great talk, realized our framework mirrors it: recursion (Nₛᵤₚ=steps, n,T=1), detach grads but new obs each step → amortizing over long context Works across generative models, neural processes, & beyond

Sarthak Mittal @sarthmit

Oct 18

Meta on meta: thrilled to share our work on Meta-learning… at Meta! 🔥🧠 We make two major contributions: 1️⃣ Unified framework revealing insights into various amortizations 🧠 2️⃣ Greedy belief-state updates to handle long context-lengths 🚀

239

Mohammad Pezeshki · Oct 30, 2025 · 2:35 AM UTC

Mohammad Pezeshki @mpezeshki91

Oct 30

My prediction is that next-token prediction loss will not last the test of time, and the next frontier models will need richer loss functions. In this paper, we take a step towards that, shifting from predicting a single token to predicting a summary of the future.

Divyat Mahajan

@divyat09

Oct 29

[1/9] While pretraining data might be hitting a wall, novel methods for modeling it are just getting started! We introduce future summary prediction (FSP), where the model predicts future sequence embeddings to reduce teacher forcing & shortcut learning. 📌Predict a learned embedding of the future sequence, not the tokens themselves

Mohammad Pezeshki · Oct 21, 2025 · 12:54 PM UTC

Mohammad Pezeshki @mpezeshki91

Oct 21

Alleviating long context issues: Iterative Amortized Inference (IAI) refines solutions step-by-step over mini-batches, just like stochastic optimization. IAI merges: - Scalability of stochastic opt. (SGD). - Expressivity of forward-pass amortization (ICL in LLMs).

Sarthak Mittal @sarthmit

Oct 18

Mohammad Pezeshki · Oct 10, 2025 · 7:19 PM UTC

Mohammad Pezeshki @mpezeshki91

Oct 10

A notable work on Red Flag Tokens for LLM safety.

Mehrnaz Mofakhami

@mhrnz_m

Oct 9

📃 New Paper Alert! ✨ A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens🚩 What do you think are some major limitations in current safety training approaches? ➡️ We think it's in their design: they rely on completely changing the model's distribution by refusing responses when content seems harmful ("Sorry, I can't answer that"). That hard switch is often brittle, leads to overrefusals, and doesn’t allow recovery from the decision if the prompt was actually benign. 🔥 We propose a post-training method to improve safety while having minimal impact on the generated distribution of natural language. Our method generates a special token excluded from the user’s vocabulary, which we call a red flag🚩token, to signal when conversations with an LLM turn harmful. 🚀 Why this helps? Cleaner eval: detecting🚩is objective, no judge required. Built-in generalization: works with in-context learning and transfers to other languages. Flexible use: we explore using🚩tokens as a hard filter or a soft trigger for safety reasoning. Check out the demo below 👇 See the 🧵 for a deep dive! 🔗 Paper: arxiv.org/pdf/2502.16366

Mohammad Pezeshki · Oct 10, 2025 · 12:28 PM UTC

Mohammad Pezeshki @mpezeshki91

Oct 10

A very nice read. Fixed chunks make ultra-long reasoning feasible. Very nice visualizations too! Congrats to the authors!

Milad Aghajohari

@MAghajohari

Oct 9

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

Kamran Chitsaz · Oct 9, 2025 · 3:14 PM UTC

Mohammad Pezeshki retweeted

Kamran Chitsaz @KChitsaz

Oct 9

Long reasoning without the quadratic tax: The Markovian Thinker makes LLMs reason in chunks with a bounded state → linear compute, constant memory and it keeps scaling beyond the training limit. 1/6

Milad Aghajohari

@MAghajohari

Oct 9

GIF

Amirhossein Kazemnejad · Oct 9, 2025 · 2:44 PM UTC

Mohammad Pezeshki retweeted

Amirhossein Kazemnejad @a_kazemnejad

Oct 9

It’s clear next-gen reasoning LLMs will run for millions of tokens. RL at 1M needs ~100× compute than 128K. Our Markovian Thinking keeps compute scaling linear instead. Check out Milad’s thread; some of my perspectives below:

Milad Aghajohari

@MAghajohari

Oct 9

898

Reza Bayat · Oct 7, 2025 · 1:29 PM UTC

Mohammad Pezeshki retweeted

Reza Bayat

@reza_byt

Oct 7

We will be presenting Sparse Activation Steering (SAS) at #COLM2025. DM if you would like to chat. Details: Wednesday afternoon (4:30 PM – 6:30 PM), Poster number: 65

Mandana Samiei @ COLM2025 · Oct 6, 2025 · 3:26 PM UTC

Mohammad Pezeshki retweeted

Mandana Samiei @ COLM2025 @MandanaSamiei

Oct 6

Excited to share that our work has been accepted to #COLM2025! We're presenting our poster at Poster Session 2 , Tuesday, Oct 7, 4:30–6:30 pm (Poster #68). If you’re in Montreal for @COLM_conf, I’d love to chat about generalization in LLMs and their underlying biases!

Anthony GX-Chen @AntChen_

Oct 5

This work has been accepted to #COLM2025 . If you are in Montreal this week for @COLM_conf and would like to chat about this (or anything related to discovery / exploration / RL), drop me a note! Poster session 2: Tuesday Oct 7, 4:30-6:30pm Poster number 68

Sarthak Mittal · Sep 27, 2025 · 3:21 AM UTC

Mohammad Pezeshki retweeted

Sarthak Mittal @sarthmit

Sep 27

Introducing RSA 🌀 (Recursive Self-Aggregation): unlocking deep thinking with test-time scaling 🔥 Qwen3-4B + RSA > DeepSeek-R1 📈 Gains across Qwen, Nemo, GPT-OSS 🏆 Benchmarks: Math • Reasoning Gym • Code ⚡ Aggregation-aware RL lets Qwen3-4B surpass o3-mini 🚀

Guillaume Lajoie · Sep 29, 2025 · 6:49 PM UTC

Mohammad Pezeshki retweeted

Guillaume Lajoie @g_lajoie_

Sep 29

Thrilled to share recent progress on LLM reasoning methods: generating multiple candidate reasoning traces and recursively refine them. RSA* implements the perfect "brain storm" within single models. Super team from @Mila_Quebec

Sarthak Mittal @sarthmit

Sep 27

Mohammad Pezeshki · Sep 8, 2025 · 8:12 PM UTC

Mohammad Pezeshki @mpezeshki91

Sep 8

Probably the first negative results on distillation! Very interesting findings!

Sachin Goyal

@goyalsachin007

Sep 8

1/Excited to share the first in a series of my research updates on LLM pretraining🚀. Our new work shows *distilled pretraining*—increasingly used to train deployable models—has trade-offs: ✅ Boosts test-time scaling ⚠️ Weakens in-context learning ✨ Needs tailored data curation

Elvis Dohmatob · Sep 2, 2025 · 12:46 PM UTC

Mohammad Pezeshki retweeted

Elvis Dohmatob @dohmatobelvis

Sep 2

Today I'll teach my very first class at @Concordia. The course is COMP 6321 (Machine Learning). Looking forward to it!

Arian Hosseini · Aug 29, 2025 · 4:39 PM UTC

Mohammad Pezeshki retweeted

Arian Hosseini @arianTBD

Aug 29

LLMs are great at single-shot problems, but in the era of experience, interactive environments are key 🔑 Introducing * Multi-Turn Puzzles (MTP) * , a new benchmark to test multi-turn reasoning and strategizing 🔗 Paper: huggingface.co/papers/2508.1… 🫙Data: huggingface.co/datasets/aria…

Paper page - Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs

huggingface.co

Saba · Aug 5, 2025 · 4:21 PM UTC

Mohammad Pezeshki retweeted

Saba @Saba_A96

Aug 5

We built a new 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 + 𝗥𝗟 image editing model using a strong verifier — and it beats SOTA diffusion baselines using 5× less data. 🔥 𝗘𝗔𝗥𝗟: a simple, scalable RL pipeline for high-quality, controllable edits. 🧵1/

Mohammad Pezeshki · Jul 16, 2025 · 4:16 PM UTC

Mohammad Pezeshki @mpezeshki91

Jul 16

Today at #ICML2025, we present Deliberate Practice: an approach to improve sample-efficiency by generating harder, not more, examples. - Oral talk at 10:45 - West Ballroom B | Orals 3C: Data-Centric ML Join us to discuss principled approaches to more efficient learning.

Reyhane Askari @ReyhaneAskari

Jul 16

Excited to present our work "Improving the scaling laws of synthetic data with deliberate practice", tomorrow at #ICML2025 📢 Oral: Wed. 10:45 AM 📍 West Ballroom B (Oral 3C Data-Centric ML) 🖼️ Poster: 🕚 11:00 AM – 1:30 PM 📍 East Exhibition Hall A-B (Poster Session 3 East)

Reza Bayat · Jul 16, 2025 · 2:59 PM UTC

Mohammad Pezeshki retweeted

Reza Bayat

@reza_byt

Jul 16

📄 New Paper Alert! ✨ 🚀Mixture of Recursions (MoR): Smaller models • Higher accuracy • Greater throughput Across 135 M–1.7 B params, MoR carves a new Pareto frontier: equal training FLOPs yet lower perplexity, higher few‑shot accuracy, and more than 2x throughput. Let’s break it down! 🧵👇

270