Some observations: For most modern LLMs, the size of the KV cache becomes comparable to the number of active parameters at around 500k-1m tokens of context (give or take) The number of attention flops becomes comparable to FFN flops at around 10-50k tokens
1
6
To put it in context (pun not intended) Ulysses: ~0.35 megatokens C++ standard: ~1.5 megatokens English Wikipedia: ~20 gigatokens
1
4
From @jxmnop research we know that LLMs compress information to around 3.6 bits/parameter in their FFN layers In comparison, KV cache only stores ~1e-4 bits (!!!!!) per parameter Looks extraordinarily wasteful!
1
3
So it seems like for longer context lengths it may be reasonable to spend a large amount of compute to squeeze the information out of the KV cache and move is somewhere else

Nov 6, 2025 · 12:51 PM UTC

1
3
It is purely speculative of course, but it may be a plausible explanation of why @thinkymachines spend so much effort on building LoRA infra for example Would love to see some research on LoRA compression efficiency
1
1