Speculative decoding is an inference-time acceleration technique where a fast “draft” model proposes multiple next tokens and a larger “target” model verifies them in a single step, accepting the longest matching prefix and falling back when predictions diverge. It reduces latency and cost while preserving the target model’s output distribution when configured correctly.
What is Speculative Decoding?
A smaller draft model (or an early-exit of the same model) generates k-step proposals. The target model then scores those proposed tokens once, compares probabilities, and accepts a prefix whose token-level agreement passes acceptance criteria; on the first mismatch, it resumes standard decoding from that point. Variants include token-wise, block-wise, and tree-speculative approaches that explore several branches. The method is orthogonal to KV-caching, batching, quantization, and optimized attention kernels, and typically yields 1.3–3× speedups depending on prompt length, sampling settings, and hardware interconnects.
FAQs
- Does it change quality? Proper acceptance rules ensure identical distribution to the target model (for greedy or temperature sampling), so quality is maintained.
- Which settings work? Greedy, top-k, nucleus (p) sampling, and penalties can be supported if the verifier applies the same decoding policy.
- How do you pick the draft model? Choose a small, fast model with high logit agreement to the target; accuracy–speed trade-offs matter.
- Can it run on one GPU? Yes, via early-exit layers as the draft (“self-speculative”) or by hosting the draft and target on the same device with careful scheduling.
- What are common pitfalls? Low acceptance rates from poor draft–target alignment, bandwidth bottlenecks in cross-device all-reduce, and mismatched sampling configs.
Why it matters and where it’s used
Speculative decoding cuts time-to-first-token and overall latency for chat assistants, copilots, and streaming endpoints. It boosts tokens-per-second without retraining, enabling lower p95 latency, higher throughput, and cost efficiency in production inference stacks. It’s especially beneficial for long generations and server-side batching.
Examples
- 7B draft + 70B target: accept long prefixes to achieve ~2× end-to-end speedup on standard chat workloads.
- Self-speculative: use early-layer predictions of the same model as the draft to avoid hosting two models.
- Tree-speculative: branch proposals to increase acceptance probability for creative writing while preserving target distribution.
- Combined optimizations: pair with KV-caching, FlashAttention, and quantization for cumulative gains.
