Model Quantization


Model quantization compresses neural networks by representing weights (and often activations) with lower-precision numbers (e.g., INT8, INT4) instead of FP16/FP32. With proper calibration or training, it preserves accuracy while reducing memory, bandwidth, and latency. For LLMs, quantization enables cheaper serving, larger context windows, and deployment on smaller GPUs/CPUs/NPUs.

What is Model Quantization?

Quantization maps floating-point tensors to integers using scale and zero-point parameters. Post-training quantization (PTQ) uses a small calibration set to estimate ranges; quantization-aware training (QAT) simulates quantization during training to recover accuracy. Weight-only schemes (e.g., GPTQ, AWQ) quantize weights while keeping activations higher precision; activation-aware methods (e.g., SmoothQuant) shift dynamic range to weights for INT8 end-to-end. Designs vary by granularity (per-tensor/per-channel/per-group), symmetry, and rounding, with group-wise INT4/INT8 common in LLMs.

Why it matters and where it’s used

Quantization shrinks memory footprint and memory traffic—the dominant bottlenecks at inference—boosting tokens/sec and lowering cost. It enables on-device/edge use, multi-tenant serving, and longer contexts by fitting KV caches and weights on limited VRAM. It composes with FlashAttention, paged attention, LoRA adapters, and speculative decoding.

Examples

  • INT8 activation + weight quantization for production chat serving at lower latency.
  • Weight-only INT4 (GPTQ/AWQ) for 7–70B LLMs on a single high-memory GPU.
  • KV-cache quantization to host more concurrent sessions and extended context windows.
  • QAT for vision transformers on edge NPUs to recover accuracy under INT8.

FAQs

  • PTQ vs QAT? PTQ calibrates after training; QAT trains with fake quantization to learn robust scales/weights—higher quality under aggressive bitwidths.
  • 8-bit vs 4-bit trade-offs? INT8 usually preserves accuracy broadly; INT4 offers bigger savings but needs careful per-channel/group quant and may degrade quality.
  • Hardware support? CUDA tensor cores, CPU AVX/AMX, mobile/edge NPUs accelerate INT8/INT4; frameworks expose kernels via PyTorch/ONNX/TensorRT.
  • Can I quantize only weights? Yes—weight-only is common for LLMs; add KV-cache quantization to further cut memory.
  • Does quantization work with LoRA? Yes; freeze quantized base weights and train small FP16 adapters; deploy merged or as runtime adapters.
Ask Our AI Assistant ×