After nearly 3 years since our NeurIPS paper, SOTA architectures are now adopting NoPE. Kimi Linear uses NoPE for all full-attention layers (not a RoPE hybrid).
The brilliant Kimi Linear paper.
It's a hybrid attention that beats full attention while cutting memory by up to 75% and keeping 1M token decoding up to 6x faster.
It cuts the key value cache by up to 75% and delivers up to 6x faster decoding at 1M context.
Full attention is slow because it compares every token with every other token and stores all past keys and values.
Kimi Linear speeds this up by keeping a small fixed memory per head and updating it step by step like a running summary, so compute and memory stop growing with length.
Their new Kimi Delta Attention adds a per channel forget gate, which means each feature can separately decide what to keep and what to fade, so useful details remain and clutter goes away.
They also add a tiny corrective update on every step, which nudges the memory toward the right mapping between keys and values instead of just piling on more data.
The model stacks 3 of these fast KDA layers then 1 full attention layer, so it still gets occasional global mixing while cutting the key value cache roughly by 75%.
Full attention layers run with no positional encoding, and KDA learns order and recency itself, which simplifies the stack and helps at long ranges.
Under the hood, a chunkwise algorithm plus a constrained diagonal plus low rank design removes unstable divisions and drops several big matrix multiplies, so the kernels run much faster on GPUs.
With the same training setup, it scores higher on common tests, long context retrieval, and math reinforcement learning, while staying fast even at 1M tokens.
It drops into existing systems, saves memory, scales to 1M tokens, and improves accuracy without serving changes.
----
Paper โ arxiv. org/abs/2510.26692
Paper Title: "Kimi Linear: An Expressive, Efficient Attention Architecture"