In the Shazeer 2019 paper "Fast Transformer Decoding: One Write-Head is All You Need", section 2.4 there's the MultiheadSelfAttentionIncremental function
The calculation of new_K (and new_V) mentions M, which isn't defined. This is a typo, s/M/x/, correct?
Apr 26, 2025 · 3:39 PM UTC

