Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK!
We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!