Back to Blog

KV Cache Optimization

· 6 min read

During autoregressive decoding, each new token attends to all previous tokens. Without caching, keys and values for the entire prefix must be recomputed at every step: step 1 costs O(1)O(1), step 2 costs O(2)O(2), and so on, summing to O(n2)O(n^2) total for a sequence of length nn.

Caching helps: once kt=htWKk_t = h_t W_K and vt=htWVv_t = h_t W_V have been computed for a token, they can be stored and future steps simply append to the cache and attend over it. The total cost is still O(n2)O(n^2) since each step attends to more tokens, but recomputation from scratch is avoided.

Storing the cache can get memory-hungry. For Llama 3 70B with 80 layers, 64 heads, dk=128d_k = 128, a batch of 8 sequences at length 1000 in fp16 needs about 21GB just for the KV cache:

KV cache=2×L×B×S×H×dk×bytes=2×80×8×1000×64×128×221GB\text{KV cache} = 2 \times L \times B \times S \times H \times d_k \times \text{bytes} = 2 \times 80 \times 8 \times 1000 \times 64 \times 128 \times 2 \approx 21\text{GB}

This scales linearly with sequence length and batch size, and quickly becomes the bottleneck.

A substantial body of work has emerged on making KV caches smaller. This post covers 11 papers, roughly organized into three approaches: pruning tokens from the cache, compressing the cache post-hoc, and redesigning the attention mechanism itself.


Token selection

Not all cached tokens matter equally. Identifying which ones are important allows the rest to be evicted, keeping the cache bounded.

H2O observes that attention scores follow a power-law: a small set of "heavy-hitter" tokens absorb most of the attention mass. Eviction is framed as submodular maximization over accumulated attention scores:

St=argmaxSSt1{xt},SkiSAiS_t = \text{argmax}_{S \subseteq S_{t-1} \cup \{x_t\}, |S| \leq k} \sum_{i \in S} A_i

This gets 5× cache reduction with minimal quality loss.

StreamingLLM tackles infinite-length sequences. It turns out initial tokens act as "attention sinks": they accumulate attention regardless of content, and removing them breaks the model even with sliding window attention. The fix is to keep a few initial sink tokens plus a rolling window of recent tokens. This yields constant memory at arbitrary length.

VATP refines the importance metric. H2O only looks at attention scores, but what actually matters for the output is attention × value. VATP scores tokens by Ikt=Sktvk1I_k^t = S_k^t \cdot \|v_k\|_1, combining accumulated attention with value magnitude. This beats H2O on most LongBench tasks at 50% compression.

MethodCache ReductionOverheadKey Insight
H2OLowHeavy-hitters follow power-law
StreamingLLMFixed budgetMinimalInitial tokens are attention sinks
VATPLowAttention × value norm

Post-hoc compression

These methods compress existing caches without changing the architecture, which is useful when retraining is not an option.

FastGen observes that different attention heads exhibit different patterns: some are local, some global, some focus on special tokens. During prefill, FastGen profiles each head and picks a compression policy (keep special tokens, sliding window, top-k by attention, or combinations). No training is needed, just runtime adaptation.

DMC learns to merge adjacent cache entries. At each step, the model predicts whether to merge with the previous entry (αt\alpha_t) and with what weight (ωt\omega_t), both derived from the first dimension of k and q. When merging:

k=ωtkt+zt1kt1ωt+zt1,v=ωtvt+zt1vt1ωt+zt1k' = \frac{\omega_t k_t + z_{t-1} k_{t-1}}{\omega_t + z_{t-1}}, \quad v' = \frac{\omega_t v_t + z_{t-1} v_{t-1}}{\omega_t + z_{t-1}}

Training uses Gumbel-Sigmoid relaxation. Gets up to 8× compression but requires fine-tuning.

L2L_2 Norm-Based Compression is the simplest approach in this group. The observation is that keys with low L2L_2 norm tend to receive high attention (softmax pushes probability toward smaller dot products when queries are normalized). The strategy reduces to keeping the keys with lowest norms—no statistics tracking required, and it remains compatible with FlashAttention.

MethodCache ReductionTraining RequiredKey Insight
FastGenConfigurableNoPer-head attention profiling
DMCUp to 8×YesLearned merging
L2 NormConfigurableNoLow norm ↔ high attention

Architectural changes

These require training from scratch (or significant fine-tuning).

MQA is the classic: all query heads share a single key-value head. This yields an H×H\times cache reduction at the cost of expressiveness; quality drops on tasks that need diverse attention patterns.

GQA interpolates between MQA and MHA: query heads are grouped, with each group sharing a KV head. With GG groups the reduction is H/GH/G. A practical benefit is that existing MHA models can be converted by mean-pooling KV heads within groups, then finetuning for a small fraction of the original compute.

MLA (from DeepSeek-V2) takes a different approach. Instead of sharing heads, it compresses keys and values into a low-rank latent space:

cKV,t=WDKVht,kC=WUKcKV,t,vC=WUVcKV,tc_{\text{KV}, t} = W_{\text{DKV}} h_t, \quad k_C = W_{\text{UK}} c_{\text{KV}, t}, \quad v_C = W_{\text{UV}} c_{\text{KV}, t}

Only cKVc_{\text{KV}} is cached (plus a small RoPE component for position), and per-head keys and values are reconstructed during attention. Per-head expressiveness is preserved while storing far less.

SnapKV targets prompt compression. During prefill, it examines attention patterns in an "observation window" at the end of the prompt to identify which earlier positions matter, then retains only those. The method plugs into existing models without retraining.

YOCO rethinks the whole architecture. Standard transformers have KV caches scaling as O(N×L)O(N \times L), which is sequence length times layers. YOCO uses a decoder-decoder design where a "self-decoder" with efficient attention (sliding window or gated retention) produces a single global KV cache that a "cross-decoder" attends to. This drops memory to O(N+L)O(N + L).

MethodCache ReductionQuality Trade-offKey Insight
MQANotableSingle shared KV head
GQAH/G×MinimalGrouped sharing
MLA~5×PreservedLatent compression
SnapKVConfigurableMinimalObservation window
YOCOArch-dependentSingle global cache

Production systems often stack these: GQA at the architecture level, plus something like H2O or SnapKV at runtime for additional savings.