Skip to content

VPG -- Vanilla Policy Gradient

Intuition

Vanilla Policy Gradient (REINFORCE) is the simplest policy gradient algorithm. It collects complete episodes, computes the total return for each timestep, and nudges the policy parameters in the direction that makes high-return actions more likely. VPG is rarely used in practice due to high variance, but it is the conceptual foundation for all policy gradient methods in rlox (A2C, PPO, TRPO).

Key Equations

The policy gradient theorem gives the gradient of the expected return:

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \, G_t \right] \]

where \(G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k\) is the discounted return-to-go.

With a baseline \(b(s_t)\) (typically a learned value function \(V_\phi(s_t)\)), the variance is reduced without introducing bias:

\[ \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \, \hat{A}_t \right] \]

where \(\hat{A}_t = G_t - V_\phi(s_t)\) is the advantage estimate.

Pseudocode

algorithm VPG:
    initialize policy network pi_theta
    initialize value network V_phi (optional baseline)

    for iteration = 1, 2, ... do
        collect trajectories {(s_t, a_t, r_t)} by running pi_theta
        compute returns G_t = sum_{k=t}^{T} gamma^{k-t} * r_k
        compute advantages A_t = G_t - V_phi(s_t)

        # Policy update
        L_policy = -mean(log pi_theta(a_t | s_t) * A_t)
        theta <- theta - alpha * grad(L_policy)

        # Baseline update
        L_value = mean((V_phi(s_t) - G_t)^2)
        phi <- phi - alpha_v * grad(L_value)

Quick Start

VPG is not implemented as a standalone algorithm in rlox. Instead, use PPO with n_epochs=1 and clip_eps=1.0 (effectively disabling clipping) for an equivalent single-pass policy gradient:

from rlox import Trainer

trainer = Trainer("ppo", env="CartPole-v1", seed=42, config={
    "n_epochs": 1,
    "clip_eps": 1.0,       # no clipping = vanilla policy gradient
    "n_steps": 2048,
    "learning_rate": 3e-4,
    "gamma": 0.99,
    "gae_lambda": 1.0,     # Monte Carlo returns (no GAE)
})
metrics = trainer.train(total_timesteps=100_000)

Hyperparameters

Since VPG is emulated via PPO, the relevant parameters from PPOConfig:

Parameter Value for VPG Description
n_epochs 1 Single pass over rollout data
clip_eps 1.0 Disables ratio clipping
n_steps 2048 Rollout length (longer is better for VPG)
learning_rate 3e-4 Adam learning rate
gamma 0.99 Discount factor
gae_lambda 1.0 Set to 1.0 for Monte Carlo returns
vf_coef 0.5 Value function loss weight
ent_coef 0.0 Entropy bonus (0 for pure VPG)

When to Use

  • Use VPG when: you are learning policy gradients for the first time, or need a minimal baseline for comparison.
  • Do not use VPG when: you need stable, sample-efficient training. Use PPO or A2C instead.

References

  • Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4), 229-256.
  • Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforcement learning with function approximation. NeurIPS.