FlashAttention is an IO-aware attention algorithm and set of fused GPU kernels that compute exact softmax attention while minimizing high-cost reads and writes to GPU DRAM. By tiling queries, keys, and values into on-chip SRAM and performing an online, numerically stable softmax, it avoids materializing the full n×n attention matrix, delivering major speed and memory gains for long sequences in training and inference.
What is FlashAttention?
FlashAttention restructures attention into blockwise computation held in shared memory. It streams K/V tiles, computes QK^T in chunks, updates running softmax statistics for stability, applies masks, and multiplies by V—all inside a fused kernel that reduces memory traffic. Successive versions expand support for causal/padding masks, variable-length packing, and mixed precision (BF16/FP16). Because it is exact (unlike sparse/low-rank approximations), model quality is preserved while throughput and attainable context length increase substantially.
Where it’s used and why it matters
Attention is bandwidth-bound at long contexts. FlashAttention boosts tokens-per-second, lowers activation memory, and enables larger batches or longer windows on the same hardware. It is widely adopted across pretraining, fine-tuning, and inference, and composes with paged attention, KV-cache quantization, tensor parallelism, and speculative decoding for cumulative gains.
Examples
- Long-context serving: support 32K–200K token windows with manageable latency and memory.
- Training: raise effective batch size or sequence length without OOM by cutting activation memory.
- Fine-tuning: run instruction/RAG tuning at longer windows on fewer GPUs.
- Multimodal: accelerate attention over image patches or audio frames in transformers.
FAQs
- Is it approximate? No—FlashAttention computes exact softmax attention via an IO-efficient schedule.
- Hardware requirements? CUDA GPUs with fast on‑chip memory; exposed via PyTorch/xFormers and vendor kernels.
- Does it help short prompts? Benefits grow with sequence length; short contexts see modest speedups.
- Can it combine with other optimizations? Yes—KV caching, quantization, parallelism, and speculative decoding are orthogonal.
- Any caveats? Kernel availability varies by framework/GPU; poor sequence packing/padding reduces gains.
