Beyond Pre-training: The Power of RLHF in LLM Alignment

Anni HuangAnni Huang
2 min read

Pre-training uses massive datasets and computational resources—often thousands of GPUs running for weeks or months—making it a domain dominated by top AI companies.

Post-training is much lighter in cost and time (often days instead of months) and focuses on aligning the model for safety, helpfulness(into specialized models such as reasoning models OpenAI O series), and personalization.

A common post-training sequence is: SFT → RLHF (DPO / PPO / GRPO)


RLHF Algorithms: Key Types, Models & Trade-offs

AlgorithmRepresentative ModelsKey IdeaProsCons
DPO (Direct Preference Optimization)DeepSeek 7B/67B Base, Qwen2, Llama 3 Herd, Mistral 8x7BOptimizes directly from human preference pairs without a reward modelSimpler pipeline, less compute than PPORisk of overfitting to preference data (e.g., coding performance drop after training cutoff in LiveCodeBench)
PPO (Proximal Policy Optimization)GPT series, Llama 3 Chat, ClaudeUses a reward model to guide optimizationProven stability, widely used in productionHigher compute and memory cost
GRPO (Group Relative Policy Optimization)DeepSeekMath 7B, DeepSeek-R1-Zero, DeepSeek-R1, SeedCoderEfficiency-focused variant of PPO~50% less memory and compute vs. PPOLess mature than PPO in industry adoption?

🔹 DPO (Direct Preference Optimization)

  • Examples: DeepSeek 7B/67B Base, Qwen2, Llama 3 Herd, Mistral 8x7B
  • Works directly from human preference pairs (good vs. bad answers), skipping the reward model.
  • Pros: Simpler pipeline, less compute required than PPO.
  • Cons: Can overfit to preference data—e.g., LiveCodeBench found DeepSeek’s coding performance dropped sharply after its training cutoff.

🔹 PPO (Proximal Policy Optimization)

  • Examples: GPT series, Llama 3 Chat, Claude (It’s speculated that Kimi V2 also uses PPO.)
  • Uses a separate reward model to guide optimization.
  • Pros: Proven stability, widely adopted in production systems.
  • Cons: More compute- and memory-intensive than DPO.

🔹 GRPO (Group Relative Policy Optimization)

  • Examples: DeepSeekMath 7B, DeepSeek-R1-Zero, DeepSeek-R1, SeedCoder
  • An efficiency-focused variant of PPO.
  • Pros: Cuts memory and compute needs by ~50% compared to PPO.

Key takeaway

  • Pre-training builds the foundation.
  • Post-training shapes the personality, safety, and performance—where RLHF algorithms like DPO, PPO, and GRPO come into play.
0
Subscribe to my newsletter

Read articles from Anni Huang directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Anni Huang
Anni Huang

I’m Anni Huang, an AI researcher-in-training currently at ByteDance, specializing in LLM training operations with a coding focus. I bridge the gap between engineering execution and model performance, ensuring the quality, reliability, and timely delivery of large-scale training projects.