Grouped-Query Attention (GQA)

Grouped-Query Attention (GQA) is an attention variant for Transformers where multiple query heads share a smaller set of key/value (K/V) heads. By grouping queries to reuse K/V projections, GQA reduces memory footprint and bandwidth of the K/V cache, improving throughput and enabling longer context windows with minimal quality loss compared to standard multi-head attention. Multi-Query Attention (MQA) is the extreme case with a single shared K/V head.

What is Grouped-Query Attention (GQA)?

GQA balances the trade-off between multi-head attention (distinct K/V per head) and MQA (one shared K/V). In a layer with H query heads and G groups (G « H), queries are partitioned; each group attends to one shared K and V projection while retaining separate query projections. This cuts K/V tensors—and thus cache size—by roughly H/G, directly reducing memory traffic during autoregressive decoding. In practice, GQA pairs well with rotary/ALiBi position encodings, FlashAttention kernels, paged KV allocators, and quantization. Quality typically remains close to MHA, while latency and VRAM usage approach MQA. Many modern LLMs train with GQA from scratch; post-hoc conversion requires care because K/V weights change shape and distribution.

Why it matters and where it’s used

Serving LLMs is bandwidth-bound at long contexts. GQA shrinks per-token reads/writes and KV cache size, enabling longer windows, higher batch throughput, and multi-tenant concurrency. It’s widely used in production chat, RAG assistants, and code copilots where steady latency and cost matter.

Examples

32-head MHA → GQA with 8 K/V groups: ~4× smaller KV cache, near-MQA latency with better quality retention.
Long-context tier: combine GQA with paged attention to keep hot KV pages on GPU and spill cold spans.
Edge/SLM serving: INT8/INT4 weight + KV quantization plus GQA to fit 7B models on small GPUs.

FAQs

How is GQA different from MQA? GQA shares K/V across groups (e.g., 8 groups), while MQA shares a single K/V across all heads; GQA usually preserves quality better.
Do I need to retrain? Ideally yes—models are trained with GQA. Converting an MHA model to GQA without fine-tuning can degrade quality.
Does GQA change math/outputs? It changes parameterization but not the attention mechanism; reductions come from fewer K/V projections.
Is GQA compatible with speculative decoding and continuous batching? Yes—GQA reduces KV bandwidth, which complements these serving optimizations.
Any pitfalls? Mismatch between trained positional encodings and grouping, imbalanced head-to-group mapping, and naive conversions without calibration.