Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!

Aug 20, 2024 · 6:01 PM UTC

The key insight underpinning our method is that Attention, Linear Attention, Mamba, etc., are all sequence transformations that operate across the input length dimension. Thus, they all have their respective matrix mixers, e.g., Softmax(QK^T). /3
Replying to @_albertgu
(2) - attention separately, we introduce "structured masked attention (SMA)", a strong generalization via *tensor contractions* of the seminal Linear Attention (arxiv.org/abs/2006.16236) method, which was a big inspiration for us (our title is an homage!) 5/
1
2
27
With this in mind, MOHAWK first matches the student's matrix mixers to the teacher's, then the hidden states at the end of each block, and finally the end-to-end model logits. Matrix Orientation + Hidden-state Alignment + Weight-transfer and Knowledge distillation = MOHAWK /4
1
3
1
25
To validate MOHAWK, we create a Mamba-2 variant, dubbed Phi-Mamba, that we distill the original Phi-1.5 model into. Using less than 1% of normal pretraining data, we're able to achieve stronger performance than many open-source subquadratic models at a similar scale! /5
1
2
21
The exact token split we use for Phi-Mamba stems from training laws we run on the three stages, where we aim to find the ideal token allocation given a fixed budget. We also have quite a few ablations that show each stage is important for downstream model performance. /6
1
12
Given the growing prevalence and better performance of hybrid models, we also release Hybrid-Phi-Mamba 1.5B. Distilled with 5B tokens, our model performs comparably to similar hybrid models (trained on comparable datasets at the 1.5B size) while using fewer Attention layers. /7
2
11
Why Mamba-2? We empirically find that it can approximate the self-attn mixer better than other alternatives like gated convs, linear attn, and RetNet. Although our final model uses Mamba, MOHAWK is a general-purpose distillation method for large families of sequence models! /8
1
17
Of course, all these findings raise more questions: Do unique properties of the matrix mixer influence the distillation process? We used a lower-quality (compared to Phi's) dataset during our distillation; how does the distillation dataset quality affect the student model? 9/9
2
12
Thanks @cartesia_ai for the generous compute support, enabling this project to be possible. Great things brewing there 👀
14
Replying to @kevinyli_
Hi Kevin, great work! A quick question: does the student mamba model and the teacher phi model have the same tokenizer?
1
3
Yes, we use the same tokenizer for both the student and teacher!
1
3
Replying to @kevinyli_
Just realized how easily scalable this is asynchronously to consumer GPUs. Each person can run a data parallel layer and sync up at the end!
2
Replying to @kevinyli_
Great work! Why the great Chinese/Asian brains still working for US universities when they have great places like @Tsinghua_Uni @PKU1898 @FudanUni, it’s time to join and 🚀 with them instead