IMPALA -- Distributed Actor-Learner Architecture¶
Intuition¶
IMPALA (Importance Weighted Actor-Learner Architecture) decouples acting from learning. Multiple actor threads collect experience in parallel while a centralized learner consumes batches from a queue. Because actors run an older policy than the learner, IMPALA uses V-trace importance sampling corrections to account for the policy lag. This enables high throughput without the staleness problems of naive asynchronous methods.
Key Equations¶
V-trace target for off-policy correction:
\[
v_s = V(x_s) + \sum_{t=s}^{s+n-1} \gamma^{t-s} \left( \prod_{i=s}^{t-1} c_i \right) \delta_t V
\]
where the temporal difference is:
\[
\delta_t V = \rho_t (r_t + \gamma V(x_{t+1}) - V(x_t))
\]
The truncated importance weights:
\[
\rho_t = \min\left(\bar{\rho}, \frac{\pi(a_t | x_t)}{\mu(a_t | x_t)}\right), \quad c_t = \min\left(\bar{c}, \frac{\pi(a_t | x_t)}{\mu(a_t | x_t)}\right)
\]
Policy gradient using V-trace advantages:
\[
\nabla_\theta J = \mathbb{E} \left[ \rho_t \nabla_\theta \log \pi_\theta(a_t | x_t) (r_t + \gamma v_{t+1} - V(x_t)) \right]
\]
Pseudocode¶
algorithm IMPALA:
initialize learner network pi_theta, V_theta
initialize experience queue Q (max size = queue_size)
launch n_actors actor threads
# Each actor thread:
actor(i):
copy weights from learner: mu <- theta
for step = 1, 2, ... do
collect n_steps transitions using mu
enqueue (states, actions, rewards, mu_probs) to Q
periodically: mu <- theta
# Learner:
for batch from Q do
compute importance weights rho_t = pi_theta(a|s) / mu(a|s)
clip: rho_t = min(rho_clip, rho_t)
clip: c_t = min(c_clip, rho_t)
compute V-trace targets v_s
compute advantages A_t = rho_t * (r_t + gamma * v_{t+1} - V(x_t))
L_policy = -mean(log pi_theta(a|s) * A_t)
L_value = mean((V_theta(s) - v_s)^2)
L_entropy = -H[pi_theta]
loss = L_policy + vf_coef * L_value + ent_coef * L_entropy
update theta with RMSprop, clip gradients to max_grad_norm
Quick Start¶
from rlox import Trainer
trainer = Trainer("impala", env="CartPole-v1", seed=42, config={
"n_actors": 4,
"n_steps": 20,
})
metrics = trainer.train(total_timesteps=500_000)
For large-scale training:
trainer = Trainer("impala", env="PongNoFrameskip-v4", seed=42, config={
"n_actors": 16,
"n_steps": 20,
"n_envs_per_actor": 2,
"learning_rate": 4e-4,
"queue_size": 32,
"rho_clip": 1.0,
"c_clip": 1.0,
})
metrics = trainer.train(total_timesteps=50_000_000)
Hyperparameters¶
All defaults from IMPALAConfig:
| Parameter | Default | Description |
|---|---|---|
learning_rate |
4e-4 |
RMSprop learning rate |
n_actors |
4 |
Number of actor threads |
n_steps |
20 |
Rollout length per actor per batch |
gamma |
0.99 |
Discount factor |
vf_coef |
0.5 |
Value loss coefficient |
ent_coef |
0.01 |
Entropy bonus coefficient |
max_grad_norm |
40.0 |
Maximum gradient norm for clipping |
rho_clip |
1.0 |
V-trace truncation for importance weights (\(\bar{\rho}\)) |
c_clip |
1.0 |
V-trace truncation for trace coefficients (\(\bar{c}\)) |
queue_size |
16 |
Maximum experience queue size |
hidden |
256 |
Hidden layer width |
n_envs_per_actor |
1 |
Environments per actor thread |
When to Use¶
- Use IMPALA when: you need high-throughput distributed training, especially for Atari-scale problems or when you have many CPU cores available.
- Do not use IMPALA when: you have a single machine with limited cores (use PPO or A2C), or need sample efficiency over throughput (use SAC).
References¶
- Espeholt, L., Soyer, H., Munos, R., et al. (2018). IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures. ICML 2018.