DeepSeek GRPO Explanation (Why we need it? How does it work? What are the findings?)

Anni HuangAnni Huang
2 min read

GRPO: Efficient RLHF via Relative Policy Optimization (Firstly introduced by DeepSeekMath, reference)

Why GRPO?

  • Problem with PPO: Slow, memory-intensive, and prone to reward overfitting in large-scale RLHF.
  • GRPO’s Advantage: A compute-efficient variant of PPO that improves stability and speed.

How GRPO Works

  1. Group-Based Relative Rewards

    • Groups responses → compares them relatively (not absolute scores).
    • Normalizes rewards within groups → reduces variance.
  2. Reference Model as Baseline

    • Uses a reference policy for advantage estimation → stabilizes updates.
  3. Memory-Optimized Training

    • Avoids PPO’s costly clipping mechanism → 50% faster convergence.

Loss Function Comparison

PPO (Proximal Policy Optimization) Loss:

L_PPO = E[ min( r_t A_t, clip(r_t, 1-ε, 1+ε) A_t ) ]
where:

  • r_t = π_new(a|s) / π_old(a|s) (probability ratio)
  • A_t = Advantage estimate (R - baseline)
  • ε = clipping range (e.g., 0.1 or 0.2)

GRPO (Group Relative Policy Optimization) Loss:

L_GRPO = E[ log(π_new(a|s)) * A_group ]
where:

  • A_group = Normalized reward within a response group
  • Reference policy π_ref replaces per-sample baselines

Key Benefits

Faster Training: 50% quicker convergence vs PPO.
Lower Memory: Efficient batch processing.
Stable Learning: Resists reward hacking (e.g., used in DeepSeek-Coder).

Why It Matters: GRPO enables scalable RLHF without sacrificing performance—ideal for aligning large models.


Key Differences in Plain Text

  1. PPO clips per-sample updates (r_t), while GRPO uses group-wise advantage (A_group).
  2. GRPO removes PPO’s clipping (min/clip), simplifying computation.
  3. GRPO’s normalization reduces reward variance vs PPO’s absolute scores.
0
Subscribe to my newsletter

Read articles from Anni Huang directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anni Huang
Anni Huang

I’m Anni Huang, an AI researcher-in-training currently at ByteDance, specializing in LLM training operations with a coding focus. I bridge the gap between engineering execution and model performance, ensuring the quality, reliability, and timely delivery of large-scale training projects.