KV Cache Optimization
During autoregressive decoding, each new token attends to all previous tokens. Without caching, you'd recompute keys and values for the entire prefix at every step—step 1 costs , step 2 costs , and so on, summing to total for a sequence of length .
The fix is obvious: cache the keys and values. Once you've computed and for a token, store them. Future steps just append to the cache and attend over it. You still pay total (you're attending to more tokens each step), but you avoid recomputing everything from scratch.
The catch is memory. For Llama 3 70B with 80 layers, 64 heads, , a batch of 8 sequences at length 1000 in fp16 needs about 21GB just for the KV cache. This scales linearly with sequence length and batch size, and quickly becomes the bottleneck.
So there's been a lot of work on making KV caches smaller. I'll cover 11 papers here, roughly organized into three approaches: pruning tokens from the cache, compressing the cache post-hoc, and redesigning the attention mechanism itself.
Token selection
The idea here is simple: not all cached tokens matter equally. If you can identify which ones are important, you can evict the rest and keep a bounded cache.
H2O observes that attention scores follow a power-law—a small set of "heavy-hitter" tokens get most of the attention mass. They frame eviction as submodular maximization over accumulated attention scores:
This gets 5× cache reduction with minimal quality loss.

StreamingLLM tackles infinite-length sequences. The key observation is that initial tokens act as "attention sinks"—they accumulate attention regardless of content, and removing them breaks the model even with sliding window attention. The fix is straightforward: keep a few initial sink tokens plus a rolling window of recent tokens. This enables constant-memory generation over arbitrarily long sequences.

VATP refines the importance metric. H2O only looks at attention scores, but what matters for the output is attention × value. So they score tokens by , combining accumulated attention with value magnitude. This beats H2O on most LongBench tasks at 50% compression.

| Method | Cache Reduction | Overhead | Key Insight |
|---|---|---|---|
| H2O | 5× | Low | Heavy-hitters follow power-law |
| StreamingLLM | Fixed budget | Minimal | Initial tokens are attention sinks |
| VATP | 2× | Low | Attention × value norm |
Post-hoc compression
These methods compress existing caches without changing the architecture. Useful when you can't retrain.
FastGen notices that different attention heads have different patterns—some are local, some global, some focus on special tokens. During prefill, it profiles each head and picks a compression policy (keep special tokens, sliding window, top-k by attention, or combinations). No training required, just runtime adaptation.

DMC learns to merge adjacent cache entries. At each step, the model predicts whether to merge with the previous entry () and with what weight (), both derived from the first dimension of k and q. When merging:
Training uses Gumbel-Sigmoid relaxation. Gets up to 8× compression but requires fine-tuning.

Norm-Based Compression is the simplest approach here. They observe that keys with low norm tend to get high attention (softmax pushes probability toward smaller dot products when queries are normalized). So just keep the keys with lowest norms. No statistics tracking, works with FlashAttention.

| Method | Cache Reduction | Training Required | Key Insight |
|---|---|---|---|
| FastGen | Configurable | No | Per-head attention profiling |
| DMC | Up to 8× | Yes | Learned merging |
| L2 Norm | Configurable | No | Low norm ↔ high attention |
Architectural changes
These require training from scratch (or significant fine-tuning) but can get bigger wins.
MQA is the classic: all query heads share a single key-value head. You get cache reduction but lose expressiveness—quality drops on tasks needing diverse attention patterns.

GQA interpolates: group query heads and have each group share a KV head. With groups you get reduction. The nice thing is you can convert existing MHA models by mean-pooling KV heads within groups, then "uptrain" for ~5% of original compute.

MLA (from DeepSeek-V2) takes a different approach. Instead of sharing heads, compress keys and values into a low-rank latent space:
You cache only (plus a small RoPE component for position), then reconstruct per-head keys and values during attention. This preserves per-head expressiveness while compressing storage.

SnapKV targets prompt compression. During prefill, it looks at attention patterns in an "observation window" at the end of the prompt to identify which earlier positions matter, then keeps only those. Integrates with existing models without retraining.

YOCO rethinks the whole architecture. Standard transformers have KV caches scaling as —sequence length times layers. YOCO uses a decoder-decoder design where a "self-decoder" with efficient attention (sliding window or gated retention) produces a single global KV cache that a "cross-decoder" attends to. This drops memory to .

| Method | Cache Reduction | Quality Trade-off | Key Insight |
|---|---|---|---|
| MQA | H× | Notable | Single shared KV head |
| GQA | H/G× | Minimal | Grouped sharing |
| MLA | ~5× | Preserved | Latent compression |
| SnapKV | Configurable | Minimal | Observation window |
| YOCO | L× | Arch-dependent | Single global cache |
In practice, production systems often combine approaches—GQA at the architecture level, plus something like H2O or SnapKV at runtime for additional savings.