Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

Songlin Yang

@SonglinYang4

9 Dec 2024

(1/10) Excited to share one of the most elegant works I’ve been working on: Parallelizing Linear Transformers with the Delta Rule over Sequence Length! 🎉 📄 Published at NeurIPS ‘24 📍 Catch my poster in person: NeurIPS East Exhibit Hall A-C #2009 🗓️ Fri, Dec 13 | 4:30–7:30 p.m

341

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(2/10)📜 Paper: arxiv.org/abs/2406.06484 🤖 Model: huggingface.co/fla-hub I’ve written a 3-part blog series about DeltaNet! 📖 Part I: The Model sustcsonglin.github.io/blog/… 📖 Part II: The Algorithm sustcsonglin.github.io/blog/… 📖 Part III: The Neural Architecture sustcsonglin.github.io/blog/…

fla-hub (fla-hub)

huggingface.co

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(3/10) Linear attention models are efficient alternatives to softmax Transformers. Yet, they struggle with in-context associative recall—DeltaNet fixes this with an elegant update rule: 1️⃣ Query memory (key) 2️⃣ Retrieve old value 3️⃣ Gating term β decides new/old use.

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(4/10) Our experiments show DeltaNet excels at in-context recall tasks (e.g., MQAR, MAD), outperforming other subquadratic models. But its fully recurrent design was inefficient.

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(5/10) We restructured DeltaNet as a linear recurrence with associative ops (matmul, +). A chunk-wise algorithm addresses memory and compute costs.

Dec 9, 2024 · 11:55 AM UTC

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(6/10) The core trick? Using WY representations for Householder matrices to reduce memory. Cumprod becomes cumsum, making chunkwise algorithm amenable.

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(7/10) The final DeltaNet form mirrors linear attention, with matrix-multiply ops. It’s optimized for tensor-core GPUs, enabling fast training.

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(8/10) We modernized DeltaNet with enhancements like short convolution, SiLU activation, and QK/output normalization—details in my blog series linked above!

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(9/10) DeltaNet achieves strong performance against other RNN model, excelling in tasks requiring in-context recall but still underperforming Transformers. Combining DeltaNet with hybrid attention addresses this gap!

Songlin Yang · Dec 9, 2024 · 11:55 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

(10/10) We scaled DeltaNet to 3B parameters, trained on 1T tokens. Large-scale hybrid models combining DeltaNet + attention mechanisms are in the works—stay tuned!

Songlin Yang · Dec 9, 2024 · 11:57 AM UTC

Songlin Yang

@SonglinYang4

9 Dec 2024

This work is a collaboration with the amazing @bailin_28, @yzhang_cs, @Yikang_Shen, and @yoonrkim!