🚀 "Quantization is not a compromise — it's the next paradigm."
After K2-Thinking's release, many developers have been curious about its native INT4 quantization format.
刘少伟, infra engineer at
@Kimi_Moonshot and Zhihu contributor, shares an insider's view on why this choice matters — and why quantization today isn't just about sacrificing precision for speed.
💡 Key idea
In the context of LLMs, quantization is no longer a trade-off.
With the evolution of param-scaling and test-time-scaling, native low-bit quantization will become a standard paradigm for large model training.
🤔 Why Low-bit Quantization Matters
In modern LLM inference, there are two distinct optimization goals:
• High throughput (cost-oriented): maximize GPU utilization via large batch sizes.
• Low latency (user-oriented): minimize per-query response time.
For Kimi-K2's MoE structure (with 1/48 sparsity), decoding is memory-bound — the smaller the model weights, the faster the compute.
FP8 weights (≈1 TB) already hit the limit of what a single high-speed interconnect GPU node can handle.
⚠️ By switching to W4A16, latency drops sharply while maintaining quality — a perfect fit for low-latency inference.
🔍 Why QAT over PTQ
Post-training quantization (PTQ) worked well for shorter generations, but failed in longer reasoning chains:
• Error accumulation during long decoding degraded precision.
• Dependence on calibration data caused "expert distortion" in sparse MoE layers.
‼️Thus, K2-Thinking adopted QAT for minimal loss and more stable long-context reasoning.
🧠 How it works
K2-Thinking uses a weight-only QAT with fake quantization + STE (straight-through estimator).
The pipeline was fully integrated in just days — from QAT training → INT4 inference → RL rollout — enabling near lossless results without extra tokens or retraining.
⚡ INT4's hidden advantage in RL
Few people mention this: native INT4 doesn't just speed up inference — it accelerates RL training itself.
Because RL rollouts often suffer from "long-tail" inefficiency, INT4's low-latency profile makes those stages much faster.
In practice, each RL iteration runs 10-20% faster end-to-end.
Moreover, quantized RL brings stability: smaller representational space reduces accumulation error, improving learning robustness.
🔩 Why INT4, not MXFP4
Kimi chose INT4 over "fancier" MXFP4/NVFP4 to better support non-Blackwell GPUs, with strong existing kernel support (e.g., Marlin).
At a quant scale of 1×32, INT4 matches FP4 formats in expressiveness while being more hardware-adaptable.
🧭 Looking forward
W4A16 is just the beginning — W4A8 and even W4A4 are on the horizon.
As new chips roll out with FP4-native operators, Kimi's quantization path will continue evolving.
"In the LLM age, quantization stands alongside SOTA and Frontier.
It's not a patch — it's how we'll reach the frontier faster."
📖 Full article (in Chinese):
zhihu.com/question/196955840…
#KimiK2Thinking #INT4 #Quantization #LLM #Infra #RLHF