EleutherAI · Jun 6, 2025 · 4:13 PM UTC

EleutherAI

Pinned Tweet

EleutherAI

@AiEleuther

Jun 6

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

138

584

Catherine Arnett @ EMNLP 🇨🇳 · Nov 4, 2025 · 12:14 AM UTC

EleutherAI retweeted

Catherine Arnett @ EMNLP 🇨🇳 @linguist_cat

Nov 4

I’m at #EMNLP2025 this week in Suzhou: At @mrl_workshop, I will present the results of our shared task, Global PIQA. @jenniferlumeng will also be presenting our work with @ruochenz_ at Blackbox NLP. Looking forward to chatting about multilingual evaluation and tokenization!

Catherine Arnett @ EMNLP 🇨🇳 · Oct 28, 2025 · 3:11 PM UTC

EleutherAI retweeted

Catherine Arnett @ EMNLP 🇨🇳 @linguist_cat

Oct 28

Our #NeurIPS2025 paper shows that even comparable monolingual tokenizers have different compression rates across languages. But by getting rid of whitespace tokenization and using a custom vocab size for each language, we can reduce token premiums. Preprint out now!

Aflah 🍉🕊️ @ EMNLP · Oct 24, 2025 · 5:10 PM UTC

EleutherAI retweeted

Aflah 🍉🕊️ @ EMNLP @Aflah02101

Oct 24

Pythia changed my career trajectory significantly!! My constant advice to people trying to enter the AI research landscape is to find good open-source communities and hangout/collaborate there You'll find good friends and get to work on great ideas 🚀

EleutherAI

@AiEleuther

Oct 24

It's been great to see this group grow from using the data and models we produced to training thier own from scratch. One of the underrated aspects of open science in ML is the ability to teach people and upskill other groups. Two years ago @Aflah02101 contributed to Pythia. Now he's co-leading a project involving training models from scratch. It was our pleasure to help them get their training set-up working to their satisfaction, and we can't wait to see what they do next.

EleutherAI · Oct 24, 2025 · 5:03 PM UTC

EleutherAI

@AiEleuther

Oct 24

Replying to @AiEleuther @Aflah02101

Full thread:

Johnny Tian-Zheng Wei @johntzwei

Oct 24

Announcing 🔭✨Hubble, a suite of open-source LLMs to advance the study of memorization! Pretrained models up to 8B params, with controlled insertion of texts (e.g., book passages, biographies, test sets, and more!) designed to emulate key memorization risks 🧵

EleutherAI · Oct 24, 2025 · 5:03 PM UTC

EleutherAI

@AiEleuther

Oct 24

Johnny Tian-Zheng Wei @johntzwei

Oct 24

Replying to @johntzwei

For this project, @NVIDIAAI provided 200K A100 hours on DGX cloud through @NSF NAIRR, @huggingface provided 100TB of storage, and training used @AiEleuther's NeoX. Thank you for your commitment to open-source science! Your gift to us is now a gift to our research community!

Nora Belrose · Oct 14, 2025 · 11:41 PM UTC

EleutherAI retweeted

Nora Belrose

@norabelrose

Oct 14

It's actually kinda hard to elicit reward hacking from coding models! We pretty much failed to get it to happen in a standard RL setup, but we do find that finetuning a model on prewritten hacks can cause generalization to hacking on unseen problems. blog.eleuther.ai/reward_hack…

Reward Hacking Resarch Update

Interim report on ongoing work on reward hacking

blog.eleuther.ai

EleutherAI · Oct 14, 2025 · 7:19 PM UTC

EleutherAI

@AiEleuther

Oct 14

Link for more info: novamath.github.io/

EleutherAI · Oct 14, 2025 · 7:19 PM UTC

EleutherAI

@AiEleuther

Oct 14

We are announcing an opportunity for paid question writers to contribute to a new PhD-level math benchmark. Accepted contributors will be paid per question and will be invited to be authors on the resulting dataset paper. Check out the link below for more information!

EleutherAI · Oct 8, 2025 · 5:11 PM UTC

EleutherAI retweeted

EleutherAI

@AiEleuther

Oct 8

And w/ @CommonCrawl and @MLCommons we are hosting the 1st Workshop on Multilingual Data Quality Signals. The core purpose of this workshop is to improve our ability to automatically identify high quality data in diverse languages. Join us on Oct 10th! wmdqs.org

WMDQS – 1st Workshop on Multilingual Data Quality Signals

A workshop addressing multilingual data quality. Held on the 10th October 2025 in Montréal.

wmdqs.org

EleutherAI · Oct 8, 2025 · 5:08 PM UTC

EleutherAI retweeted

EleutherAI

@AiEleuther

Oct 8

RWKV v7 was presented yesterday, but if you're interested in RADLADS the poster is today 4:30-6:30 #31. Really great work by @smerkyg @eric_alcaide and @picocreator! arxiv.org/abs/2505.03005

EleutherAI · Oct 8, 2025 · 5:13 PM UTC

EleutherAI

@AiEleuther

Oct 8

Replying to @AiEleuther @CommonCrawl @MLCommons

As a final transparency note we had two papers rejected from CoLM, "Evaluating SAE interpretability without explanations" and "Sparse Autoencoders Trained on the Same Data Learn Different Features." Both papers have preprints on arXiv, and we hope to present them at another conference soon.

EleutherAI · Oct 8, 2025 · 5:08 PM UTC

EleutherAI

@AiEleuther

Oct 8

RWKV v7 is our latest work on attention-free architectures, featuring a powerful 3B model trained for 3.1T tokens and a proof that the architecture can recognize a broader class of problems than transformers under standard CC assumptions arxiv.org/abs/2503.14456

RWKV-7 "Goose" with Expressive Dynamic State Evolution

We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top...

arxiv.org

EleutherAI · Oct 8, 2025 · 5:08 PM UTC

EleutherAI

@AiEleuther

Oct 8

We're excited (albeit a little late) to share two papers and one event at #COLM2025! In collaboration with @recursal_AI and @RWKV_AI we are excited to present the newest iteration on the RWKV architecture and RADLADS, a cheap and efficient way to convert transformers to RNNs with minimal loss in performance.

EleutherAI · Oct 7, 2025 · 5:45 PM UTC

EleutherAI

@AiEleuther

Oct 7

E2LM asks the community to build stage-appropriate early eval signals. 0.5B/1B/3B models with intermediate checkpoints, multiple data mixtures, Colab friendly baselines and prizes! Build evals that yield smooth, informative learning curves early on.

EleutherAI · Oct 7, 2025 · 5:45 PM UTC

EleutherAI

@AiEleuther

Oct 7

While pretraining loss is predictable with scale, early-stage models often look noisy on standard evals. MCQA is unpredictable since it depends on probability mass over incorrect options. Completion-style signals often correlate better. arxiv.org/abs/2406.04391 (downstream cappabe)

Why Has Predicting Downstream Capabilities of Frontier AI Models...

Predicting changes from scaling advanced AI systems is a desirable property for engineers, economists, governments and industry alike, and, while a well-established literature exists on how...

arxiv.org

EleutherAI · Oct 7, 2025 · 5:45 PM UTC

EleutherAI

@AiEleuther

Oct 7

If you're interested in building better eval metrics, definitely check out this NeurIPS competition! Goal: design evals that actually track early LM learning (0–200B tokens) NeurIPS 2025 E2LM Competition e2lmc.github.io/

E2LM Competition 2025

e2lmc.github.io

Catherine Arnett @ EMNLP 🇨🇳 · Oct 6, 2025 · 9:28 PM UTC

EleutherAI retweeted

Catherine Arnett @ EMNLP 🇨🇳 @linguist_cat

Oct 6

I’m in Montreal this week for @COLM_conf and @wmdqs! Looking forward to chatting about tokenizers, multilingual data, and more! #COLM2025

EleutherAI · Sep 26, 2025 · 2:26 PM UTC

EleutherAI

@AiEleuther

Sep 26

"Tokenizer-free" models are all the rage, but are they actually free of tokenization? @linguist_cat digs into why allegedly tokenizer-free models are really "tokenizer-hidden" models and what it looks like to think rigorously about tokenization.

Catherine Arnett @ EMNLP 🇨🇳 @linguist_cat

Sep 25

I have a new blog post about the so-called “tokenizer-free” approach to language modeling and why it’s not tokenizer-free at all. I also talk about why people hate tokenizers so much!