Key-Value Cache (KV Cache)


A Key-Value (KV) Cache is a Transformer inference optimization that stores the projected attention keys and values for previously processed tokens so subsequent decoding steps attend only to cached history rather than recomputing it. KV caching dramatically reduces per-token compute and memory traffic, enabling low latency, high throughput, and longer contexts in production LLM serving.

What is Key-Value Cache (KV Cache)?

In decoder-only Transformers, each layer/head forms K and V tensors from token embeddings. During autoregressive generation at step t, the model needs K,V for tokens 1…t to compute attention from the new query Qt. The KV Cache persists those tensors across steps, eliminating repeated forward passes over the prompt. Practical servers shard caches by layer/head, pack variable-length sequences efficiently, and track strides/masks for causal attention. Advanced systems support multi-query/grouped-query attention (to reduce KV size), quantize KV (8/4-bit), and spill or page cold segments between GPU and CPU while keeping hot spans on device.

Why it matters and where it’s used

KV caching lifts tokens-per-second and caps p95 latency by avoiding redundant work. It enables continuous batching (admitting new requests each step), long-context tiers, and multi-turn chats where histories remain resident. It is foundational for RAG assistants, code copilots, and agents that loop over tool calls.

Examples

  • Streamed chat: prefill once, then decode tokens with near O(1) incremental work per layer.
  • Continuous batching: allocate reuseable cache slots to serve heterogeneous sequence lengths.
  • Long-context: keep active KV on GPU; page older spans to CPU/NVMe to extend window.
  • Multimodal: retain visual token KV across turns for image-grounded dialog.

FAQs

  • Does KV caching change outputs? No—math is unchanged; it reuses computed K,V tensors.
  • How large is the cache? Roughly: layers × heads × head_dim × seq_len × precision; long contexts dominate memory.
  • Can I quantize the cache? Yes—8/4-bit KV often preserves quality with big memory savings; monitor latency/accuracy.
  • KV vs paged attention? Paged attention organizes KV into fixed-size pages to reduce fragmentation and enable continuous batching.
  • Any pitfalls? Fragmentation, poor packing of mixed lengths, and spill stalls. Use page allocators, telemetry, and careful packing.
Ask Our AI Assistant ×