Artur Chakhvadze · Nov 6, 2025 · 12:33 PM UTC

Artur Chakhvadze

Artur Chakhvadze

@norpadon

Nov 6

Some observations: For most modern LLMs, the size of the KV cache becomes comparable to the number of active parameters at around 500k-1m tokens of context (give or take) The number of attention flops becomes comparable to FFN flops at around 10-50k tokens

Artur Chakhvadze · Nov 6, 2025 · 12:36 PM UTC

Artur Chakhvadze

@norpadon

Nov 6

To put it in context (pun not intended) Ulysses: ~0.35 megatokens C++ standard: ~1.5 megatokens English Wikipedia: ~20 gigatokens

Artur Chakhvadze · Nov 6, 2025 · 12:43 PM UTC

Artur Chakhvadze

@norpadon

Nov 6

From @jxmnop research we know that LLMs compress information to around 3.6 bits/parameter in their FFN layers In comparison, KV cache only stores ~1e-4 bits (!!!!!) per parameter Looks extraordinarily wasteful!

Artur Chakhvadze · Nov 6, 2025 · 12:51 PM UTC

Artur Chakhvadze · Nov 6, 2025 · 12:51 PM UTC

Artur Chakhvadze

@norpadon

Nov 6

So it seems like for longer context lengths it may be reasonable to spend a large amount of compute to squeeze the information out of the KV cache and move is somewhere else

Nov 6, 2025 · 12:51 PM UTC

Artur Chakhvadze · Nov 6, 2025 · 1:02 PM UTC

Artur Chakhvadze

@norpadon

Nov 6

It is purely speculative of course, but it may be a plausible explanation of why @thinkymachines spend so much effort on building LoRA infra for example Would love to see some research on LoRA compression efficiency