We can't learn from zero rewards
ยท 1 min read
In RL, GRPO trains models by comparing responses to the same prompt. It computes an advantage: how much better or worse each response is compared to the average.
If all responses get the same reward, the advantage is zero. Zero advantage means zero gradient, which means no learning.
I think the same principle applies beyond RL. If we only experience success, we learn nothing because we can't tell luck from skill. we don't know what matters. No gradient to follow.
But if we only experience failure, we learn nothing either. We have no signal about which failures were closer to success. the gradient is still zero. Stories from failed projects or experiments offer us very limited information.
In RL, DAPO solves this by filtering out prompts where everything succeeded or everything failed. Only train on the ones with variance.
We learn the most from contrast: seeing some paths succeed and others fail helps us understand which differences mattered.