Back to Blog

Causes of entropy collapse

· 3 min read

Entropy collapse in LLM RL is when a policy's output distribution narrows over training. One can observe it as pass@1 climbing while pass@k stagnates.

The literature claims the following reasons:

  • The gradient depletes entropy by construction. For softmax policies, ΔHE[Cov(logπ,A)]\Delta H \approx -\mathbb{E}[\mathrm{Cov}(\log \pi, A)]. Each PG step mechanically reinforces already-confident, high-advantage tokens; rare-but-good tokens that would push entropy back up are too scarce to compensate. Off-policy variants (PPO, GRPO) take multiple updates per trajectory batch and amplify this. Strictly on-policy RLOO loses far less entropy under the same setup. Delightful Policy Gradient (DG) targets this asymmetry by gating each update with a sigmoid of advantage × action surprisal logπ-\log \pi, amplifying those rare-but-good tokens (breakthroughs). [1, 8, 10, 11]
  • PPO clipping is asymmetric. With ϵ=0.2\epsilon = 0.2, an exploit token at p=0.9p = 0.9 can rise to 1.081.08 (clipped); an explore token at p=0.01p = 0.01 can rise only to 0.0120.012. The surrogate's geometry caps upward motion of low-prob tokens. [2, 10]
  • Numerical precision quietly biases entropy downward. Computing the importance ratio r=πθnew/πθoldr = \pi_\theta^\mathrm{new}/\pi_\theta^\mathrm{old} in BF16 gives an upward multiplicative bias proportional to rr, so the upper clip triggers more often than the lower one, the opposite of DAPO's intended asymmetry. Combined with the vLLM-vs-training forward-pass mismatch, this can flip an algorithm from preserving entropy to collapsing it; switching to FP16 and fixing the cast is reportedly enough to undo it. [10]
  • The objective's optimum is already collapsed. With KL regularization to πref\pi_\mathrm{ref} and verifiable rewards, the global optimum is πrefexp(R/β)\pi_\mathrm{ref} \cdot \exp(R/\beta), a single sharp peak at sane β\beta. Pass@1 reward is unimodal-by-design too. No optimizer fixes this without changing the objective; standard PG is just the first-order term of a diversity-preserving max-likelihood objective. [4, 5, 6]
  • RL only reweights within the base model's support. On-policy gradients reinforce what the policy already samples, so mass moves between existing modes rather than into new ones. Post-RLVR responses turn out to be a high-likelihood subset of the base model's outputs. [3, 7]
  • A fixed prompt set turns into memorization. With limited problems and verifiable rewards, the policy converges onto a few memorized correct trajectories per prompt. Reward-hacking via the data, upstream of the algorithm. [9]

References

  1. The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (Cui et al., 2025)
  2. DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al., 2025)
  3. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (Yue et al., 2025)
  4. KL-Regularized Reinforcement Learning is Designed to Mode Collapse (GX-Chen et al., 2025)
  5. Maximum Likelihood Reinforcement Learning (Tajwar et al., 2026)
  6. Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models (Chen et al., 2025)
  7. Understanding the Effects of RLHF on LLM Generalisation and Diversity (Kirk et al., 2024)
  8. Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs (Wen et al., 2025)
  9. Beyond Pass@1: Self-play with Variational Problem Synthesis Sustains RLVR (Liang et al., 2025)
  10. Entropy-preserving Reinforcement Learning (Petrenko et al., 2026)
  11. Delightful Policy Gradient (Osband, 2026)