You Can't Learn From Zero Rewards
In RL, GRPO trains models by comparing responses to the same prompt. It computes an advantage: how much better or worse each response is compared to the average.
If all responses get the same reward, the advantage is zero. Zero advantage means zero gradient. Zero gradient means no learning.
The same principle applies beyond RL. All success teaches you nothing. You can't tell luck from skill. You don't know what matters. No gradient to follow.
But all failure teaches you nothing either. You have no signal about which failures were closer to success and the gradient is still zero. Stories from failed projects or experiments offer very limited information.
In RL, DAPO solves this by filtering out prompts where everything succeeded or everything failed. Only train on the ones with variance.
What teaches is contrast. You need to see some paths succeed and others fail to understand which differences mattered. Without that, you're just wandering in the dark.