TRPO -- Trust Region Policy Optimization¶
Intuition¶
TRPO directly constrains the KL divergence between successive policies, guaranteeing monotonic improvement (in theory). It solves a constrained optimization problem using conjugate gradients and a backtracking line search, avoiding the need to choose a clipping parameter. TRPO is more principled than PPO but significantly more expensive per update due to the second-order optimization.
Key Equations¶
The constrained optimization problem:
\[
\max_\theta \; \mathbb{E}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t | s_t)} \hat{A}_t \right]
\]
\[
\text{subject to} \quad \mathbb{E}_t \left[ D_\text{KL}(\pi_{\theta_\text{old}}(\cdot | s_t) \| \pi_\theta(\cdot | s_t)) \right] \leq \delta
\]
This is solved approximately via the natural gradient:
\[
\Delta\theta = \sqrt{\frac{2\delta}{\mathbf{g}^T \mathbf{F}^{-1} \mathbf{g}}} \, \mathbf{F}^{-1} \mathbf{g}
\]
where \(\mathbf{g}\) is the policy gradient and \(\mathbf{F}\) is the Fisher information matrix. The product \(\mathbf{F}^{-1} \mathbf{g}\) is computed via conjugate gradients without forming \(\mathbf{F}\) explicitly.
Pseudocode¶
algorithm TRPO:
initialize actor pi_theta, critic V_phi
for iteration = 1, 2, ... do
collect n_steps * n_envs transitions using pi_theta_old
compute GAE advantages A_t
# Natural gradient via conjugate gradients
g = gradient of surrogate objective w.r.t. theta
s = CG(F, g, cg_iters) # solve Fs = g approximately
step_size = sqrt(2 * max_kl / (s^T F s))
# Backtracking line search
for j = 0, 1, ..., line_search_steps do
theta_new = theta_old + (0.5^j) * step_size * s
if surrogate improves AND KL(theta_old, theta_new) <= max_kl:
theta <- theta_new
break
# Value function update (standard SGD)
for epoch = 1 to vf_epochs do
update phi to minimize (V_phi(s) - G_t)^2
Quick Start¶
from rlox import Trainer
trainer = Trainer("trpo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)
For continuous control:
trainer = Trainer("trpo", env="HalfCheetah-v4", seed=42, config={
"max_kl": 0.01,
"n_steps": 2048,
"gamma": 0.99,
"gae_lambda": 0.97,
"cg_iters": 10,
"vf_lr": 1e-3,
"vf_epochs": 5,
})
metrics = trainer.train(total_timesteps=1_000_000)
Hyperparameters¶
All defaults from TRPOConfig:
| Parameter | Default | Description |
|---|---|---|
max_kl |
0.01 |
Maximum KL divergence per update |
damping |
0.1 |
Damping for Fisher vector product |
cg_iters |
10 |
Conjugate gradient iterations |
line_search_steps |
10 |
Backtracking line search steps |
n_envs |
8 |
Number of parallel environments |
n_steps |
2048 |
Rollout length per environment per update |
gamma |
0.99 |
Discount factor |
gae_lambda |
0.97 |
GAE lambda |
vf_lr |
1e-3 |
Value function learning rate |
vf_epochs |
5 |
Value function SGD epochs per update |
When to Use¶
- Use TRPO when: you need formal trust-region guarantees on policy updates, or when PPO's clipping is insufficient for your problem's sensitivity.
- Do not use TRPO when: wall-clock time matters (PPO is faster), or you need off-policy sample efficiency (use SAC).
References¶
- Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust Region Policy Optimization. ICML 2015.