GRPO: Efficient RLHF via Relative Policy Optimization (Firstly introduced by DeepSeekMath, reference)

Why GRPO?

Problem with PPO: Slow, memory-intensive, and prone to reward overfitting in large-scale RLHF.
GRPO’s Advantage: A compute-efficient variant of PPO that improves stability and speed.

How GRPO Works

Group-Based Relative Rewards
- Groups responses → compares them relatively (not absolute scores).
- Normalizes rewards within groups → reduces variance.
Reference Model as Baseline
- Uses a reference policy for advantage estimation → stabilizes updates.
Memory-Optimized Training
- Avoids PPO’s costly clipping mechanism → 50% faster convergence.

Loss Function Comparison

PPO (Proximal Policy Optimization) Loss:

L_PPO = E[ min( r_t A_t, clip(r_t, 1-ε, 1+ε) A_t ) ]
where:

r_t = π_new(a|s) / π_old(a|s) (probability ratio)
A_t = Advantage estimate (R - baseline)
ε = clipping range (e.g., 0.1 or 0.2)

GRPO (Group Relative Policy Optimization) Loss:

L_GRPO = E[ log(π_new(a|s)) * A_group ]
where:

A_group = Normalized reward within a response group
Reference policy π_ref replaces per-sample baselines

Key Benefits

✅ Faster Training: 50% quicker convergence vs PPO.
✅ Lower Memory: Efficient batch processing.
✅ Stable Learning: Resists reward hacking (e.g., used in DeepSeek-Coder).

Why It Matters: GRPO enables scalable RLHF without sacrificing performance—ideal for aligning large models.

Key Differences in Plain Text

PPO clips per-sample updates (r_t), while GRPO uses group-wise advantage (A_group).
GRPO removes PPO’s clipping (min/clip), simplifying computation.
GRPO’s normalization reduces reward variance vs PPO’s absolute scores.

DeepSeek GRPO Explanation (Why we need it? How does it work? What are the findings?)

Table of contents