Back to Blog

You can't learn from zero rewards

1 min read

In RL, GRPO trains models by comparing responses to the same prompt. It computes an advantage: how much better or worse each response is compared to the average.

If all responses get the same reward, the advantage is zero. Zero advantage means zero gradient. Zero gradient means no learning.

I think the same principle applies beyond RL. If I only experience success, I learn nothing. I can't tell luck from skill. I don't know what matters. No gradient to follow.

But if I only experience failure, I learn nothing either. I have no signal about which failures were closer to success. the gradient is still zero. Stories from failed projects or experiments offer me very limited information.

In RL, DAPO solves this by filtering out prompts where everything succeeded or everything failed. Only train on the ones with variance.

one learns the most from contrast: seeing some paths succeed and others fail helps us understand which differences mattered.