Causes of entropy collapse
· 3 min read
Entropy collapse in LLM RL is when a policy's output distribution narrows over training. One can observe it as pass@1 climbing while pass@k stagnates.
The literature claims the following reasons:
- The gradient depletes entropy by construction. For softmax policies, . Each PG step mechanically reinforces already-confident, high-advantage tokens; rare-but-good tokens that would push entropy back up are too scarce to compensate. Off-policy variants (PPO, GRPO) take multiple updates per trajectory batch and amplify this. Strictly on-policy RLOO loses far less entropy under the same setup. Delightful Policy Gradient (DG) targets this asymmetry by gating each update with a sigmoid of advantage × action surprisal , amplifying those rare-but-good tokens (breakthroughs). [1, 8, 10, 11]
- PPO clipping is asymmetric. With , an exploit token at can rise to (clipped); an explore token at can rise only to . The surrogate's geometry caps upward motion of low-prob tokens. [2, 10]
- Numerical precision quietly biases entropy downward. Computing the importance ratio in BF16 gives an upward multiplicative bias proportional to , so the upper clip triggers more often than the lower one, the opposite of DAPO's intended asymmetry. Combined with the vLLM-vs-training forward-pass mismatch, this can flip an algorithm from preserving entropy to collapsing it; switching to FP16 and fixing the cast is reportedly enough to undo it. [10]
- The objective's optimum is already collapsed. With KL regularization to and verifiable rewards, the global optimum is , a single sharp peak at sane . Pass@1 reward is unimodal-by-design too. No optimizer fixes this without changing the objective; standard PG is just the first-order term of a diversity-preserving max-likelihood objective. [4, 5, 6]
- RL only reweights within the base model's support. On-policy gradients reinforce what the policy already samples, so mass moves between existing modes rather than into new ones. Post-RLVR responses turn out to be a high-likelihood subset of the base model's outputs. [3, 7]
- A fixed prompt set turns into memorization. With limited problems and verifiable rewards, the policy converges onto a few memorized correct trajectories per prompt. Reward-hacking via the data, upstream of the algorithm. [9]
References
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models (Cui et al., 2025)
- DAPO: An Open-Source LLM Reinforcement Learning System at Scale (Yu et al., 2025)
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (Yue et al., 2025)
- KL-Regularized Reinforcement Learning is Designed to Mode Collapse (GX-Chen et al., 2025)
- Maximum Likelihood Reinforcement Learning (Tajwar et al., 2026)
- Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models (Chen et al., 2025)
- Understanding the Effects of RLHF on LLM Generalisation and Diversity (Kirk et al., 2024)
- Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs (Wen et al., 2025)
- Beyond Pass@1: Self-play with Variational Problem Synthesis Sustains RLVR (Liang et al., 2025)
- Entropy-preserving Reinforcement Learning (Petrenko et al., 2026)
- Delightful Policy Gradient (Osband, 2026)