Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li

Kevin Li @kevinyli_

20 Aug 2024

Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!

Aug 20, 2024 · 6:01 PM UTC

472

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

Paper: arxiv.org/abs/2408.10189 Code: github.com/goombalab/phi-mam… Blog: goombalab.github.io/blog/202… w/ @avivbick @ericxing @zicokolter @_albertgu /2

GitHub - goombalab/phi-mamba: Official implementation of Phi-Mamba. A MOHAWK-distilled model...

Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models) - goombalab/phi-mamba

github.com

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

The key insight underpinning our method is that Attention, Linear Attention, Mamba, etc., are all sequence transformations that operate across the input length dimension. Thus, they all have their respective matrix mixers, e.g., Softmax(QK^T). /3

Albert Gu

@_albertgu

3 Jun 2024

Replying to @_albertgu

(2) - attention separately, we introduce "structured masked attention (SMA)", a strong generalization via *tensor contractions* of the seminal Linear Attention (arxiv.org/abs/2006.16236) method, which was a big inspiration for us (our title is an homage!) 5/

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

With this in mind, MOHAWK first matches the student's matrix mixers to the teacher's, then the hidden states at the end of each block, and finally the end-to-end model logits. Matrix Orientation + Hidden-state Alignment + Weight-transfer and Knowledge distillation = MOHAWK /4

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

To validate MOHAWK, we create a Mamba-2 variant, dubbed Phi-Mamba, that we distill the original Phi-1.5 model into. Using less than 1% of normal pretraining data, we're able to achieve stronger performance than many open-source subquadratic models at a similar scale! /5

Kevin Li · Aug 20, 2024 · 6:01 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

The exact token split we use for Phi-Mamba stems from training laws we run on the three stages, where we aim to find the ideal token allocation given a fixed budget. We also have quite a few ablations that show each stage is important for downstream model performance. /6

Kevin Li · Aug 20, 2024 · 6:02 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

Given the growing prevalence and better performance of hybrid models, we also release Hybrid-Phi-Mamba 1.5B. Distilled with 5B tokens, our model performs comparably to similar hybrid models (trained on comparable datasets at the 1.5B size) while using fewer Attention layers. /7

Kevin Li · Aug 20, 2024 · 6:02 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

Why Mamba-2? We empirically find that it can approximate the self-attn mixer better than other alternatives like gated convs, linear attn, and RetNet. Although our final model uses Mamba, MOHAWK is a general-purpose distillation method for large families of sequence models! /8

Kevin Li · Aug 20, 2024 · 6:02 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

Of course, all these findings raise more questions: Do unique properties of the matrix mixer influence the distillation process? We used a lower-quality (compared to Phi's) dataset during our distillation; how does the distillation dataset quality affect the student model? 9/9

Kevin Li · Aug 20, 2024 · 6:26 PM UTC

Kevin Li @kevinyli_

20 Aug 2024

Thanks @cartesia_ai for the generous compute support, enabling this project to be possible. Great things brewing there 👀

Shizhe Diao · Aug 24, 2024 · 10:26 PM UTC

Shizhe Diao @shizhediao

24 Aug 2024

Replying to @kevinyli_

Hi Kevin, great work! A quick question: does the student mamba model and the teacher phi model have the same tokenizer?

Kevin Li · Aug 24, 2024 · 10:45 PM UTC

Kevin Li @kevinyli_

24 Aug 2024

Yes, we use the same tokenizer for both the student and teacher!

more replies

Cody Steinmetz · Aug 22, 2024 · 10:40 AM UTC

Cody Steinmetz

@0xCodyS

22 Aug 2024

Replying to @kevinyli_

Just realized how easily scalable this is asynchronously to consumer GPUs. Each person can run a data parallel layer and sync up at the end!

Fatima Al-Rammah · Aug 24, 2024 · 11:07 AM UTC

Fatima Al-Rammah @fatima_alrammah

24 Aug 2024

Replying to @kevinyli_

thanx

মিতদ্রু · Aug 24, 2024 · 4:11 PM UTC

মিতদ্রু @Mitodru

24 Aug 2024

Replying to @kevinyli_

Great work! Why the great Chinese/Asian brains still working for US universities when they have great places like @Tsinghua_Uni @PKU1898 @FudanUni, it’s time to join and 🚀 with them instead