A non-profit research lab focused on interpretability, alignment, and ethics of artificial intelligence. Creators of GPT-J, GPT-NeoX, Pythia, and VQGAN-CLIP

Joined August 2022
Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
I’m at #EMNLP2025 this week in Suzhou: At @mrl_workshop, I will present the results of our shared task, Global PIQA. @jenniferlumeng will also be presenting our work with @ruochenz_ at Blackbox NLP. Looking forward to chatting about multilingual evaluation and tokenization!
2
2
43
Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!
4
9
41
Pythia changed my career trajectory significantly!! My constant advice to people trying to enter the AI research landscape is to find good open-source communities and hangout/collaborate there You'll find good friends and get to work on great ideas 🚀
It's been great to see this group grow from using the data and models we produced to training thier own from scratch. One of the underrated aspects of open science in ML is the ability to teach people and upskill other groups. Two years ago @Aflah02101 contributed to Pythia. Now he's co-leading a project involving training models from scratch. It was our pleasure to help them get their training set-up working to their satisfaction, and we can't wait to see what they do next.
2
17
Full thread:
Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵
4
It's been great to see this group grow from using the data and models we produced to training thier own from scratch. One of the underrated aspects of open science in ML is the ability to teach people and upskill other groups. Two years ago @Aflah02101 contributed to Pythia. Now he's co-leading a project involving training models from scratch. It was our pleasure to help them get their training set-up working to their satisfaction, and we can't wait to see what they do next.
Replying to @johntzwei
For this project, @NVIDIAAI provided 200K A100 hours on DGX cloud through @NSF NAIRR, @huggingface provided 100TB of storage, and training used @AiEleuther's NeoX. Thank you for your commitment to open-source science! Your gift to us is now a gift to our research community!
3
1
18
EleutherAI retweeted
It's actually kinda hard to elicit reward hacking from coding models! We pretty much failed to get it to happen in a standard RL setup, but we do find that finetuning a model on prewritten hacks can cause generalization to hacking on unseen problems. blog.eleuther.ai/reward_hack…
1
6
1
83
We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!
1
5
2
23
EleutherAI retweeted
And w/ @CommonCrawl and @MLCommons we are hosting the 1st Workshop on Multilingual Data Quality Signals. The core purpose of this workshop is to improve our ability to automatically identify high quality data in diverse languages. Join us on Oct 10th! wmdqs.org
1
3
3
EleutherAI retweeted
RWKV v7 was presented yesterday, but if you're interested in RADLADS the poster is today 4:30-6:30 #31. Really great work by @smerkyg @eric_alcaide and @picocreator! arxiv.org/abs/2505.03005
1
5
14
As a final transparency note we had two papers rejected from CoLM, "Evaluating SAE interpretability without explanations" and "Sparse Autoencoders Trained on the Same Data Learn Different Features." Both papers have preprints on arXiv, and we hope to present them at another conference soon.
1
11
RWKV v7 is our latest work on attention-free architectures, featuring a powerful 3B model trained for 3.1T tokens and a proof that the architecture can recognize a broader class of problems than transformers under standard CC assumptions arxiv.org/abs/2503.14456
1
1
9
We're excited (albeit a little late) to share two papers and one event at #COLM2025! In collaboration with @recursal_AI and @RWKV_AI we are excited to present the newest iteration on the RWKV architecture and RADLADS, a cheap and efficient way to convert transformers to RNNs with minimal loss in performance.
1
2
1
18
E2LM asks the community to build stage-appropriate early eval signals. 0.5B/1B/3B models with intermediate checkpoints, multiple data mixtures, Colab friendly baselines and prizes! Build evals that yield smooth, informative learning curves early on.
While pretraining loss is predictable with scale, early-stage models often look noisy on standard evals. MCQA is unpredictable since it depends on probability mass over incorrect options. Completion-style signals often correlate better. arxiv.org/abs/2406.04391 (downstream cappabe)
1
2
If you're interested in building better eval metrics, definitely check out this NeurIPS competition! Goal: design evals that actually track early LM learning (0–200B tokens) NeurIPS 2025 E2LM Competition e2lmc.github.io/
1
3
14
I’m in Montreal this week for @COLM_conf and @wmdqs! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025
2
3
43
"Tokenizer-free" models are all the rage, but are they actually free of tokenization? @linguist_cat digs into why allegedly tokenizer-free models are really "tokenizer-hidden" models and what it looks like to think rigorously about tokenization.
I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!
2
1
15