PPO — Proximal Policy Optimization¶
Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017.
Key Idea¶
PPO constrains the policy update to a trust region by clipping the probability ratio between the new and old policy, preventing destructively large updates. It achieves performance comparable to TRPO while being far simpler to implement, requiring only first-order optimization. This makes it the de facto default for both classic RL and (until recently) LLM post-training.
Mathematical Formulation¶
Notation:
- π_θ — current policy parameterized by θ
- π_θ_old — policy before the update
- r_t(θ) = π_θ(a_t | s_t) / π_θ_old(a_t | s_t) — probability ratio
- Â_t — advantage estimate (typically GAE-λ)
- ε — clipping parameter
Clipped surrogate objective:
Full loss with value function and entropy bonus:
L(θ) = E_t [ L^CLIP(θ) - c₁ · L^VF(θ) + c₂ · S[π_θ](s_t) ]
where:
L^VF(θ) = (V_θ(s_t) - V_t^target)²
S[π](s) = -Σ_a π(a|s) log π(a|s)
Value function target (GAE):
Â_t^GAE(γ,λ) = Σ_{l=0}^{T-t} (γλ)^l · δ_{t+l}
δ_t = r_t + γ · V(s_{t+1}) - V(s_t)
V_t^target = Â_t^GAE + V(s_t)
Properties¶
- On-policy, model-free
- Actor-critic architecture (shared or separate backbones)
- Supports both continuous and discrete action spaces
Key Hyperparameters¶
| Parameter | Typical Value | Notes |
|---|---|---|
ε (clip) |
0.1–0.2 | 0.2 most common for classic RL |
γ |
0.99 | Discount factor |
λ (GAE) |
0.95 | Bias-variance tradeoff |
c₁ (VF coef) |
0.5 | Value loss weight |
c₂ (entropy coef) |
0.01 | Entropy bonus weight |
| Minibatch size | 64–4096 | Larger for LLM training |
| Epochs per rollout | 3–10 | Passes over collected data |
| Learning rate | 3e-4 (Adam) | Often linearly decayed |
Complexity¶
- Time per update: O(B × K × C_forward) — B=batch, K=epochs, C=forward/backward cost
- Memory: O(T × N × d_obs) for rollout buffer + model parameters
- Sample efficiency: Poor (on-policy — data discarded after each update)
Primary Use Cases¶
- Continuous control: MuJoCo locomotion (Humanoid, Ant, HalfCheetah)
- Game playing: OpenAI Five (Dota 2), hide-and-seek emergent behavior
- Robotics: Sim-to-real transfer (OpenAI Rubik's cube)
- LLM alignment: RLHF pipeline (InstructGPT, ChatGPT, Llama 2)
- LLM reasoning: Used in early DeepSeek-R1 experiments before GRPO
Known Limitations¶
- Sample inefficient — on-policy data used for only a few epochs then discarded
- Sensitive to hyperparameters — especially ε, learning rate, and number of epochs
- Implementation details matter — "37 Implementation Details" paper identified 13+ tricks PPO relies on
- Reward hacking — in LLM RLHF, PPO can overoptimize the reward model
- Critic overhead — for LLMs, requires a second model of comparable size
- Advantage estimation degrades in very long episodes
Major Variants¶
| Variant | Reference | Key Change |
|---|---|---|
| TRPO | Schulman et al., ICML 2015 | KL constraint (PPO predecessor) |
| PPO-Penalty | — | Adaptive KL penalty instead of clipping |
| APPO | Espeholt et al., ICML 2018 | Asynchronous distributed training |
| PPG | Cobbe et al., ICML 2021 | Separate phases for policy and value |
| Dual-clip PPO | — | Clips from both sides for negative advantages |
Relationship to Other Algorithms¶
- Direct ancestor of RLHF-PPO — same algorithm, different reward source
- GRPO was developed specifically to replace PPO for LLM training by removing the critic
- DPO bypasses PPO entirely by reformulating the RLHF objective
- MAPPO is the multi-agent extension
- Shares GAE advantage estimation with essentially all modern policy gradient methods
Industry Deployment¶
- OpenAI: Dota 2 (OpenAI Five), ChatGPT (RLHF-PPO), robotics
- DeepMind: Some game-playing agents
- Meta: Llama 2 RLHF training
- Robotics: Numerous companies for sim-to-real transfer
- Frameworks: Default in SB3, CleanRL, RLlib