Model quantization compresses neural networks by representing weights (and often activations) with lower-precision numbers (e.g., INT8, INT4) instead of FP16/FP32. With proper calibration or training, it preserves accuracy while reducing memory, bandwidth, and latency. For LLMs, quantization enables cheaper serving, larger context windows, and deployment on smaller GPUs/CPUs/NPUs.
What is Model Quantization?
Quantization maps floating-point tensors to integers using scale and zero-point parameters. Post-training quantization (PTQ) uses a small calibration set to estimate ranges; quantization-aware training (QAT) simulates quantization during training to recover accuracy. Weight-only schemes (e.g., GPTQ, AWQ) quantize weights while keeping activations higher precision; activation-aware methods (e.g., SmoothQuant) shift dynamic range to weights for INT8 end-to-end. Designs vary by granularity (per-tensor/per-channel/per-group), symmetry, and rounding, with group-wise INT4/INT8 common in LLMs.
Why it matters and where it’s used
Quantization shrinks memory footprint and memory traffic—the dominant bottlenecks at inference—boosting tokens/sec and lowering cost. It enables on-device/edge use, multi-tenant serving, and longer contexts by fitting KV caches and weights on limited VRAM. It composes with FlashAttention, paged attention, LoRA adapters, and speculative decoding.
Examples
- INT8 activation + weight quantization for production chat serving at lower latency.
- Weight-only INT4 (GPTQ/AWQ) for 7–70B LLMs on a single high-memory GPU.
- KV-cache quantization to host more concurrent sessions and extended context windows.
- QAT for vision transformers on edge NPUs to recover accuracy under INT8.
FAQs
- PTQ vs QAT? PTQ calibrates after training; QAT trains with fake quantization to learn robust scales/weights—higher quality under aggressive bitwidths.
- 8-bit vs 4-bit trade-offs? INT8 usually preserves accuracy broadly; INT4 offers bigger savings but needs careful per-channel/group quant and may degrade quality.
- Hardware support? CUDA tensor cores, CPU AVX/AMX, mobile/edge NPUs accelerate INT8/INT4; frameworks expose kernels via PyTorch/ONNX/TensorRT.
- Can I quantize only weights? Yes—weight-only is common for LLMs; add KV-cache quantization to further cut memory.
- Does quantization work with LoRA? Yes; freeze quantized base weights and train small FP16 adapters; deploy merged or as runtime adapters.
