Model Quantization | nexusofnerds.com

Model quantization compresses neural networks by representing weights (and often activations) with lower-precision numbers (e.g., INT8, INT4) instead of FP16/FP32. With proper calibration or training, it preserves accuracy while reducing memory, bandwidth, and latency. For LLMs, quantization enables cheaper serving, larger context windows, and deployment on smaller GPUs/CPUs/NPUs.

What is Model Quantization?

Quantization maps floating-point tensors to integers using scale and zero-point parameters. Post-training quantization (PTQ) uses a small calibration set to estimate ranges; quantization-aware training (QAT) simulates quantization during training to recover accuracy. Weight-only schemes (e.g., GPTQ, AWQ) quantize weights while keeping activations higher precision; activation-aware methods (e.g., SmoothQuant) shift dynamic range to weights for INT8 end-to-end. Designs vary by granularity (per-tensor/per-channel/per-group), symmetry, and rounding, with group-wise INT4/INT8 common in LLMs.

Why it matters and where it’s used

Quantization shrinks memory footprint and memory traffic—the dominant bottlenecks at inference—boosting tokens/sec and lowering cost. It enables on-device/edge use, multi-tenant serving, and longer contexts by fitting KV caches and weights on limited VRAM. It composes with FlashAttention, paged attention, LoRA adapters, and speculative decoding.

Examples

INT8 activation + weight quantization for production chat serving at lower latency.
Weight-only INT4 (GPTQ/AWQ) for 7–70B LLMs on a single high-memory GPU.
KV-cache quantization to host more concurrent sessions and extended context windows.
QAT for vision transformers on edge NPUs to recover accuracy under INT8.

FAQs

PTQ vs QAT? PTQ calibrates after training; QAT trains with fake quantization to learn robust scales/weights—higher quality under aggressive bitwidths.
8-bit vs 4-bit trade-offs? INT8 usually preserves accuracy broadly; INT4 offers bigger savings but needs careful per-channel/group quant and may degrade quality.
Hardware support? CUDA tensor cores, CPU AVX/AMX, mobile/edge NPUs accelerate INT8/INT4; frameworks expose kernels via PyTorch/ONNX/TensorRT.
Can I quantize only weights? Yes—weight-only is common for LLMs; add KV-cache quantization to further cut memory.
Does quantization work with LoRA? Yes; freeze quantized base weights and train small FP16 adapters; deploy merged or as runtime adapters.