Jean Kaddour

In RL, GRPO trains models by comparing responses to the same prompt. It computes an advantage: how much better or worse each response is compared to the average.

If all responses get the same reward, the advantage is zero. Zero advantage means zero gradient. Zero gradient means no learning.

All success teaches you nothing. You can't tell luck from skill. You don't know what matters. No gradient to follow.

But more sadly, all failure teaches you nothing either. You have no signal about which failures were closer to success and the gradient is still zero. Stories from failed projects or experiments offer very limited information.

In RL, DAPO solves this by filtering out prompts where everything succeeded or everything failed. Only train on the ones with variance.

What teaches is contrast. You need to see some paths succeed and others fail to understand which differences mattered. Without variance, failure is just noise. You're wandering in the dark with no sense of whether you're getting warmer or colder.