In the Shazeer 2019 paper "Fast Transformer Decoding: One Write-Head is All You Need", section 2.4 there's the MultiheadSelfAttentionIncremental function
The calculation of new_K (and new_V) mentions M, which isn't defined. This is a typo, s/M/x/, correct?