Skip to content

PPO -- Proximal Policy Optimization

Intuition

PPO prevents destructively large policy updates by clipping the probability ratio between the new and old policy. This gives most of the stability benefits of TRPO's trust region constraint, but with a first-order optimizer and minimal implementation complexity. PPO is the default algorithm in rlox and the recommended starting point for most tasks.

Key Equations

The clipped surrogate objective:

\[ L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right] \]

where \(r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)}\) is the importance sampling ratio.

The full loss combines policy, value, and entropy terms:

\[ L(\theta) = -L^{CLIP}(\theta) + c_1 \, L^{VF}(\theta) - c_2 \, H[\pi_\theta] \]

Advantages are computed using Generalized Advantage Estimation (GAE):

\[ \hat{A}_t = \sum_{l=0}^{T-t} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

Pseudocode

algorithm PPO:
    initialize actor-critic network pi_theta, V_theta
    for iteration = 1, 2, ... do
        collect n_steps * n_envs transitions using pi_theta_old
        compute GAE advantages A_t using Rust data plane
        normalize advantages (if enabled)

        for epoch = 1 to n_epochs do
            for minibatch in shuffle(rollout, batch_size) do
                r_t = pi_theta(a|s) / pi_old(a|s)
                L_clip = min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t)
                L_vf = (V_theta(s) - G_t)^2
                L_ent = -H[pi_theta]
                loss = -L_clip + vf_coef * L_vf + ent_coef * L_ent
                update theta with Adam, clip gradients to max_grad_norm

Quick Start

from rlox import Trainer

trainer = Trainer("ppo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")

For continuous control (MuJoCo):

trainer = Trainer("ppo", env="HalfCheetah-v4", seed=42, config={
    "learning_rate": 3e-4,
    "n_steps": 2048,
    "n_epochs": 10,
    "batch_size": 64,
    "gamma": 0.99,
    "gae_lambda": 0.95,
    "normalize_obs": True,
    "normalize_rewards": True,
})
metrics = trainer.train(total_timesteps=1_000_000)

Hyperparameters

All defaults from PPOConfig:

Parameter Default Description
n_envs 8 Number of parallel environments
n_steps 128 Rollout length per environment per update
n_epochs 4 SGD passes over each rollout
batch_size 256 Minibatch size for SGD
learning_rate 2.5e-4 Adam learning rate
clip_eps 0.2 PPO clipping range for probability ratio
vf_coef 0.5 Value loss coefficient
ent_coef 0.01 Entropy bonus coefficient
max_grad_norm 0.5 Maximum gradient norm for clipping
gamma 0.99 Discount factor
gae_lambda 0.95 GAE lambda
normalize_advantages True Normalize advantages per minibatch
clip_vloss True Clip value function loss
anneal_lr True Linearly anneal learning rate
normalize_rewards False Running reward normalization
normalize_obs False Running observation normalization

When to Use

  • Use PPO when: you want a reliable, general-purpose algorithm that works across discrete and continuous action spaces with minimal tuning.
  • Do not use PPO when: sample efficiency is critical (prefer SAC or TD3 for continuous control) or you need hard trust-region guarantees (prefer TRPO).

References

  • Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347.
  • Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438.