Pre-training uses massive datasets and computational resources—often thousands of GPUs running for weeks or months—making it a domain dominated by top AI companies.

Post-training is much lighter in cost and time (often days instead of months) and focuses on aligning the model for safety, helpfulness(into specialized models such as reasoning models OpenAI O series), and personalization.

A common post-training sequence is: SFT → RLHF (DPO / PPO / GRPO)

RLHF Algorithms: Key Types, Models & Trade-offs

Algorithm	Representative Models	Key Idea	Pros	Cons
DPO (Direct Preference Optimization)	DeepSeek 7B/67B Base, Qwen2, Llama 3 Herd, Mistral 8x7B	Optimizes directly from human preference pairs without a reward model	Simpler pipeline, less compute than PPO	Risk of overfitting to preference data (e.g., coding performance drop after training cutoff in LiveCodeBench)
PPO (Proximal Policy Optimization)	GPT series, Llama 3 Chat, Claude	Uses a reward model to guide optimization	Proven stability, widely used in production	Higher compute and memory cost
GRPO (Group Relative Policy Optimization)	DeepSeekMath 7B, DeepSeek-R1-Zero, DeepSeek-R1, SeedCoder	Efficiency-focused variant of PPO	~50% less memory and compute vs. PPO	Less mature than PPO in industry adoption?

🔹 DPO (Direct Preference Optimization)

Examples: DeepSeek 7B/67B Base, Qwen2, Llama 3 Herd, Mistral 8x7B
Works directly from human preference pairs (good vs. bad answers), skipping the reward model.
Pros: Simpler pipeline, less compute required than PPO.
Cons: Can overfit to preference data—e.g., LiveCodeBench found DeepSeek’s coding performance dropped sharply after its training cutoff.

🔹 PPO (Proximal Policy Optimization)

Examples: GPT series, Llama 3 Chat, Claude (It’s speculated that Kimi V2 also uses PPO.)
Uses a separate reward model to guide optimization.
Pros: Proven stability, widely adopted in production systems.
Cons: More compute- and memory-intensive than DPO.

🔹 GRPO (Group Relative Policy Optimization)

Examples: DeepSeekMath 7B, DeepSeek-R1-Zero, DeepSeek-R1, SeedCoder
An efficiency-focused variant of PPO.
Pros: Cuts memory and compute needs by ~50% compared to PPO.

Key takeaway

Pre-training builds the foundation.
Post-training shapes the personality, safety, and performance—where RLHF algorithms like DPO, PPO, and GRPO come into play.

Beyond Pre-training: The Power of RLHF in LLM Alignment

Table of contents

RLHF Algorithms: Key Types, Models & Trade-offs

Key takeaway

Subscribe to my newsletter

Anni Huang

Anni Huang