Skip to content

IMPALA -- Distributed Actor-Learner Architecture

Intuition

IMPALA (Importance Weighted Actor-Learner Architecture) decouples acting from learning. Multiple actor threads collect experience in parallel while a centralized learner consumes batches from a queue. Because actors run an older policy than the learner, IMPALA uses V-trace importance sampling corrections to account for the policy lag. This enables high throughput without the staleness problems of naive asynchronous methods.

Key Equations

V-trace target for off-policy correction:

\[ v_s = V(x_s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left( \prod_{i=s}^{t-1} c_i \right) \delta_t V \]

where the temporal difference is:

\[ \delta_t V = \rho_t (r_t + \gamma V(x_{t+1}) - V(x_t)) \]

The truncated importance weights:

\[ \rho_t = \min\left(\bar{\rho}, \frac{\pi(a_t | x_t)}{\mu(a_t | x_t)}\right), \quad c_t = \min\left(\bar{c}, \frac{\pi(a_t | x_t)}{\mu(a_t | x_t)}\right) \]

Policy gradient using V-trace advantages:

\[ \nabla_\theta J = \mathbb{E} \left[ \rho_t \nabla_\theta \log \pi_\theta(a_t | x_t) (r_t + \gamma v_{t+1} - V(x_t)) \right] \]

Pseudocode

algorithm IMPALA:
    initialize learner network pi_theta, V_theta
    initialize experience queue Q (max size = queue_size)
    launch n_actors actor threads

    # Each actor thread:
    actor(i):
        copy weights from learner: mu <- theta
        for step = 1, 2, ... do
            collect n_steps transitions using mu
            enqueue (states, actions, rewards, mu_probs) to Q
            periodically: mu <- theta

    # Learner:
    for batch from Q do
        compute importance weights rho_t = pi_theta(a|s) / mu(a|s)
        clip: rho_t = min(rho_clip, rho_t)
        clip: c_t = min(c_clip, rho_t)
        compute V-trace targets v_s
        compute advantages A_t = rho_t * (r_t + gamma * v_{t+1} - V(x_t))

        L_policy  = -mean(log pi_theta(a|s) * A_t)
        L_value   = mean((V_theta(s) - v_s)^2)
        L_entropy = -H[pi_theta]

        loss = L_policy + vf_coef * L_value + ent_coef * L_entropy
        update theta with RMSprop, clip gradients to max_grad_norm

Quick Start

from rlox import Trainer

trainer = Trainer("impala", env="CartPole-v1", seed=42, config={
    "n_actors": 4,
    "n_steps": 20,
})
metrics = trainer.train(total_timesteps=500_000)

For large-scale training:

trainer = Trainer("impala", env="PongNoFrameskip-v4", seed=42, config={
    "n_actors": 16,
    "n_steps": 20,
    "n_envs_per_actor": 2,
    "learning_rate": 4e-4,
    "queue_size": 32,
    "rho_clip": 1.0,
    "c_clip": 1.0,
})
metrics = trainer.train(total_timesteps=50_000_000)

Hyperparameters

All defaults from IMPALAConfig:

Parameter Default Description
learning_rate 4e-4 RMSprop learning rate
n_actors 4 Number of actor threads
n_steps 20 Rollout length per actor per batch
gamma 0.99 Discount factor
vf_coef 0.5 Value loss coefficient
ent_coef 0.01 Entropy bonus coefficient
max_grad_norm 40.0 Maximum gradient norm for clipping
rho_clip 1.0 V-trace truncation for importance weights (\(\bar{\rho}\))
c_clip 1.0 V-trace truncation for trace coefficients (\(\bar{c}\))
queue_size 16 Maximum experience queue size
hidden 256 Hidden layer width
n_envs_per_actor 1 Environments per actor thread

When to Use

  • Use IMPALA when: you need high-throughput distributed training, especially for Atari-scale problems or when you have many CPU cores available.
  • Do not use IMPALA when: you have a single machine with limited cores (use PPO or A2C), or need sample efficiency over throughput (use SAC).

References

  • Espeholt, L., Soyer, H., Munos, R., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. ICML 2018.