Skip to content

A2C -- Advantage Actor-Critic

Intuition

A2C is the synchronous variant of A3C. It uses a shared actor-critic network trained with a single gradient step per rollout (no clipping, no replay). Multiple parallel environments provide diverse experience that reduces correlation between samples. A2C is simpler and faster per update than PPO, but less stable for large policy changes.

Key Equations

The policy gradient with advantage:

\[ \nabla_\theta J(\theta) = \mathbb{E}_t \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \, \hat{A}_t \right] \]

Advantage estimated via GAE (or N-step returns when \(\lambda = 1\)):

\[ \hat{A}_t = \sum_{l=0}^{n-1} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) \]

The combined loss:

\[ L(\theta) = -L^{\pi}(\theta) + c_1 \, L^{VF}(\theta) - c_2 \, H[\pi_\theta] \]

where \(L^{VF} = \frac{1}{2} \| V_\theta(s_t) - G_t \|^2\).

Pseudocode

algorithm A2C:
    initialize shared actor-critic network pi_theta, V_theta

    for iteration = 1, 2, ... do
        collect n_steps transitions from n_envs parallel environments
        compute GAE advantages A_t (with gae_lambda=1.0 for n-step returns)
        compute returns G_t = A_t + V_theta(s_t)

        # Single gradient step (no minibatches, no epochs)
        L_policy = -mean(log pi_theta(a_t | s_t) * A_t)
        L_value  = mean((V_theta(s_t) - G_t)^2)
        L_entropy = -H[pi_theta]

        loss = L_policy + vf_coef * L_value + ent_coef * L_entropy
        update theta with RMSprop, clip gradients to max_grad_norm

Quick Start

from rlox import Trainer

trainer = Trainer("a2c", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)

For Atari-style environments:

trainer = Trainer("a2c", env="PongNoFrameskip-v4", seed=42, config={
    "learning_rate": 7e-4,
    "n_steps": 5,
    "n_envs": 16,
    "gamma": 0.99,
    "gae_lambda": 1.0,
    "ent_coef": 0.01,
})
metrics = trainer.train(total_timesteps=10_000_000)

Hyperparameters

All defaults from A2CConfig:

Parameter Default Description
learning_rate 7e-4 RMSprop learning rate
n_steps 5 Rollout length per update
gamma 0.99 Discount factor
gae_lambda 1.0 GAE lambda (1.0 = full n-step returns)
vf_coef 0.5 Value function loss coefficient
ent_coef 0.01 Entropy bonus coefficient
max_grad_norm 0.5 Gradient clipping threshold
normalize_advantages False Normalize advantages per batch
n_envs 8 Number of parallel environments
hidden 64 Hidden layer width

When to Use

  • Use A2C when: you want a fast, simple on-policy baseline that is easy to debug and scales well with parallel environments.
  • Do not use A2C when: you need stable training with large batch sizes (use PPO) or sample efficiency (use SAC).

References

  • Mnih, V., Badia, A. P., Mirza, M., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. ICML (A3C, the async predecessor).
  • Stable Baselines3 A2C implementation (synchronous variant).