Some observations:
For most modern LLMs, the size of the KV cache becomes comparable to the number of active parameters at around 500k-1m tokens of context (give or take)
The number of attention flops becomes comparable to FFN flops at around 10-50k tokens
To put it in context (pun not intended)
Ulysses: ~0.35 megatokens
C++ standard: ~1.5 megatokens
English Wikipedia: ~20 gigatokens
From @jxmnop research we know that LLMs compress information to around 3.6 bits/parameter in their FFN layers
In comparison, KV cache only stores ~1e-4 bits (!!!!!) per parameter
Looks extraordinarily wasteful!
So it seems like for longer context lengths it may be reasonable to spend a large amount of compute to squeeze the information out of the KV cache and move is somewhere else
Nov 6, 2025 · 12:51 PM UTC
It is purely speculative of course, but it may be a plausible explanation of why @thinkymachines spend so much effort on building LoRA infra for example
Would love to see some research on LoRA compression efficiency
