Eli Bendersky · Apr 26, 2025 · 3:39 PM UTC

Eli Bendersky · Apr 26, 2025 · 3:39 PM UTC

Eli Bendersky

Eli Bendersky @elibendersky

Apr 26

In the Shazeer 2019 paper "Fast Transformer Decoding: One Write-Head is All You Need", section 2.4 there's the MultiheadSelfAttentionIncremental function The calculation of new_K (and new_V) mentions M, which isn't defined. This is a typo, s/M/x/, correct?

Apr 26, 2025 · 3:39 PM UTC