Paged Attention is a memory-management technique for serving large language models that stores key–value (KV) caches in fixed-size "pages" and schedules them dynamically across GPU and CPU memory. By paging, pinning, and reusing cache blocks, it reduces fragmentation, enables efficient continuous batching, and supports longer contexts with stable latency.
What is Paged Attention?
Paged Attention treats the KV cache as a virtual memory space. During prefill and decode, tokens write/read KV tensors into page-aligned blocks that can be swapped, pinned, and reclaimed without copying entire sequences. A runtime allocator tracks per-request pages and compacts free space to avoid fragmentation as users join/leave batches. Combined with attention kernels, this allows high-throughput serving with heterogeneous lengths, speculative decoding, and tensor-parallel sharding. Compared with naive contiguous buffers, paging improves utilization and avoids OOM when many sequences with variable lengths are in flight.
Why it matters and where it’s used
Production LLM endpoints face bursty, mixed-length traffic. Paged Attention increases tokens-per-second, raises capacity under SLAs, and lets providers offer long-context tiers without massive GPU overhead. It powers inference frameworks for chat, RAG, code assistants, and agents running multi-turn sessions.
Examples
- Continuous batching: admit new requests every step by assigning free pages instead of waiting for full batch resets.
- Long-context support: page inactive KV segments to CPU while keeping hot pages on GPU.
- Tenant isolation: per-session page quotas to cap memory; reclaim on timeout.
- Interop: combine with FlashAttention, quantized KV, and speculative decoding for compound gains.
FAQs
- Is it an approximation? No. Paging reorganizes memory layout; attention math is unchanged.
- Does it require specific hardware? Any CUDA-capable GPU; gains improve with fast PCIe/NVLink for spill/swap.
- How does it compare to contiguous KV buffers? It avoids fragmentation and copy overheads, improving throughput at scale.
- Can it degrade latency? Poor paging policies or oversubscription can cause stalls; telemetry-driven tuning is essential.
- Does it help training? Primarily an inference optimization; training uses different memory patterns.
