Skip to content

TRPO -- Trust Region Policy Optimization

Intuition

TRPO directly constrains the KL divergence between successive policies, guaranteeing monotonic improvement (in theory). It solves a constrained optimization problem using conjugate gradients and a backtracking line search, avoiding the need to choose a clipping parameter. TRPO is more principled than PPO but significantly more expensive per update due to the second-order optimization.

Key Equations

The constrained optimization problem:

\[ \max_\theta \; \mathbb{E}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)} \hat{A}_t \right] \]
\[ \text{subject to} \quad \mathbb{E}_t \left[ D_\text{KL}(\pi_{\theta_\text{old}}(\cdot | s_t) \| \pi_\theta(\cdot | s_t)) \right] \leq \delta \]

This is solved approximately via the natural gradient:

\[ \Delta\theta = \sqrt{\frac{2\delta}{\mathbf{g}^T \mathbf{F}^{-1} \mathbf{g}}} \, \mathbf{F}^{-1} \mathbf{g} \]

where \(\mathbf{g}\) is the policy gradient and \(\mathbf{F}\) is the Fisher information matrix. The product \(\mathbf{F}^{-1} \mathbf{g}\) is computed via conjugate gradients without forming \(\mathbf{F}\) explicitly.

Pseudocode

algorithm TRPO:
    initialize actor pi_theta, critic V_phi

    for iteration = 1, 2, ... do
        collect n_steps * n_envs transitions using pi_theta_old
        compute GAE advantages A_t

        # Natural gradient via conjugate gradients
        g = gradient of surrogate objective w.r.t. theta
        s = CG(F, g, cg_iters)   # solve Fs = g approximately
        step_size = sqrt(2 * max_kl / (s^T F s))

        # Backtracking line search
        for j = 0, 1, ..., line_search_steps do
            theta_new = theta_old + (0.5^j) * step_size * s
            if surrogate improves AND KL(theta_old, theta_new) <= max_kl:
                theta <- theta_new
                break

        # Value function update (standard SGD)
        for epoch = 1 to vf_epochs do
            update phi to minimize (V_phi(s) - G_t)^2

Quick Start

from rlox import Trainer

trainer = Trainer("trpo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)

For continuous control:

trainer = Trainer("trpo", env="HalfCheetah-v4", seed=42, config={
    "max_kl": 0.01,
    "n_steps": 2048,
    "gamma": 0.99,
    "gae_lambda": 0.97,
    "cg_iters": 10,
    "vf_lr": 1e-3,
    "vf_epochs": 5,
})
metrics = trainer.train(total_timesteps=1_000_000)

Hyperparameters

All defaults from TRPOConfig:

Parameter Default Description
max_kl 0.01 Maximum KL divergence per update
damping 0.1 Damping for Fisher vector product
cg_iters 10 Conjugate gradient iterations
line_search_steps 10 Backtracking line search steps
n_envs 8 Number of parallel environments
n_steps 2048 Rollout length per environment per update
gamma 0.99 Discount factor
gae_lambda 0.97 GAE lambda
vf_lr 1e-3 Value function learning rate
vf_epochs 5 Value function SGD epochs per update

When to Use

  • Use TRPO when: you need formal trust-region guarantees on policy updates, or when PPO's clipping is insufficient for your problem's sensitivity.
  • Do not use TRPO when: wall-clock time matters (PPO is faster), or you need off-policy sample efficiency (use SAC).

References

  • Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML 2015.