RLHF-PPO — Reinforcement Learning from Human Feedback¶
Christiano et al., "Deep Reinforcement Learning from Human Preferences," NeurIPS, 2017. Ouyang et al., "Training language models to follow instructions with human feedback (InstructGPT)," NeurIPS, 2022. Stiennon et al., "Learning to summarize from human feedback," NeurIPS, 2020.
Key Idea¶
RLHF-PPO is a three-stage pipeline: (1) Supervised fine-tuning (SFT) on demonstration data, (2) Training a reward model (RM) on human preference comparisons, (3) Optimizing the SFT model using PPO with the learned RM as reward, subject to a KL constraint against the SFT model. This was the method that produced ChatGPT, InstructGPT, and Claude's early versions.
Mathematical Formulation¶
Stage 1 — SFT:
Stage 2 — Reward Model (Bradley-Terry):
Stage 3 — PPO with KL penalty:
R_total(x, y) = r_φ(x, y) - β · D_KL(π_θ(·|x) || π_ref(·|x))
Objective: max_θ E_{x~D, y~π_θ} [ R_total(x, y) ]
Optimized using PPO with:
- Policy = the language model π_θ
- Critic = a value model V_ψ(x, y_{<t}) (often initialized from RM)
- Actions = tokens, states = (prompt, generated tokens so far)
- Reward given at end of sequence
Properties¶
- On-policy (PPO), model-free
- Requires 4 models simultaneously: policy, reference, reward model, value model
- Three-stage pipeline (SFT → RM → RL)
Key Hyperparameters¶
| Parameter | Typical Value | Notes |
|---|---|---|
PPO ε |
0.1–0.2 | Clipping |
β (KL penalty) |
0.01–0.2 | Adaptive in some impls |
| PPO epochs | 1–4 | Per batch of generations |
| RM size | Same as policy or smaller | |
| Batch size | 256–1024 prompts | |
| Temperature | 0.7–1.0 | For rollouts |
| Learning rate | 1e-6 to 5e-6 | |
| Reward normalization | Yes | Running mean/std |
Complexity¶
- Memory: 4 full models. For 7B model: ~56GB in fp16, requiring multi-GPU
- Time: Generation (~70%) + reward scoring (~5%) + PPO update (~25%)
- Far more complex than DPO or GRPO in engineering
- Scales poorly — each component must scale with model size
Primary Use Cases¶
- InstructGPT / ChatGPT (OpenAI) — the original application
- Claude (Anthropic) — with Constitutional AI modifications
- Llama 2 (Meta) — RLHF with rejection sampling
- General-purpose LLM alignment for helpfulness and harmlessness
Known Limitations¶
- Extreme engineering complexity — 4 models, distributed training, generation loop
- Reward model bottleneck — can be hacked, gamed, or miscalibrated
- KL penalty tuning is delicate
- Proxy reward problem — RM can't capture all quality aspects
- Very expensive — 4× model memory overhead
- Mode collapse — policy converges to narrow high-reward outputs
- Error amplification — RM errors amplified by PPO optimization
- Being displaced by GRPO (reasoning) and DPO (alignment)
Major Variants¶
| Variant | Reference | Key Change |
|---|---|---|
| Constitutional AI | Bai et al., Anthropic 2022 | AI-generated feedback |
| ReST/RAFT | Dong et al., TMLR 2023 | Best-of-N filtering + SFT |
| RLHF + PRM | — | Step-level rewards for reasoning |
| RM ensembles | — | Robustness to reward hacking |
| Iterative RLHF | — | Continuous preference collection |
Relationship to Other Algorithms¶
- PPO is the RL algorithm used in Stage 3
- DPO was invented specifically to simplify this pipeline (eliminates Stages 2 and 3)
- GRPO simplifies Stage 3 by removing the critic
- As of 2026: RLHF-PPO is the "legacy" approach; GRPO and DPO increasingly preferred
Industry Deployment¶
- OpenAI: InstructGPT, GPT-4
- Anthropic: Claude (with Constitutional AI)
- Meta: Llama 2 (combined with rejection sampling)
- Google DeepMind: Gemini (partial RLHF)
- Still used at frontier labs for non-reasoning alignment
- Open-source: TRL, OpenRLHF, DeepSpeed-Chat