Skip to content

AWR -- Advantage Weighted Regression

Intuition

AWR sidesteps the instabilities of importance sampling by directly weighting policy log-probabilities with exponentiated advantages. Instead of computing probability ratios between old and new policies (as PPO does), AWR fits the actor via weighted maximum likelihood: actions that turned out to be much better than the baseline receive exponentially higher weight. This makes AWR particularly simple to implement and suitable for both online and offline settings.

Key Equations

The actor loss uses advantage-weighted regression:

\[ L_{\text{actor}}(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \exp\!\left(\frac{A(s,a)}{\beta}\right) \log \pi_\theta(a|s) \right] \]

where \(\beta\) is a temperature parameter controlling the sharpness of the weighting and \(A(s,a) = r + \gamma V(s') - V(s)\) is the TD advantage.

The critic is trained with standard TD regression:

\[ L_{\text{critic}}(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( V_\phi(s) - \left( r + \gamma V_{\bar{\phi}}(s') \right) \right)^2 \right] \]

Exponentiated advantages are clamped to prevent overflow:

\[ w(s,a) = \min\!\left( \exp\!\left(\frac{A(s,a)}{\beta}\right),\; w_{\max} \right) \]

Pseudocode

algorithm AWR:
    initialize actor pi_theta, critic V_phi
    initialize replay buffer D with capacity buffer_size

    for step = 1 to total_timesteps do
        if step < learning_starts then
            a = random action
        else
            a = sample from pi_theta(.|s)

        execute a, observe r, s', done
        store (s, a, r, done, s') in D

        if step >= learning_starts and |D| >= batch_size then
            sample minibatch {(s, a, r, done, s')} from D

            # Critic update
            targets = r + gamma * (1 - done) * V_phi(s')
            L_critic = MSE(V_phi(s), targets)
            update phi

            # Actor update (AWR)
            A = targets - V_phi(s)
            w = clamp(exp(A / beta), max=max_advantage)
            L_actor = -mean(w * log pi_theta(a|s))
            update theta

Quick Start

from rlox import Trainer

trainer = Trainer("awr", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=50_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")

For continuous control:

trainer = Trainer("awr", env="Pendulum-v1", seed=42, config={
    "beta": 0.5,
    "learning_rate": 3e-4,
    "batch_size": 256,
    "buffer_size": 100_000,
    "gamma": 0.99,
})
metrics = trainer.train(total_timesteps=100_000)

Hyperparameters

Parameter Default Description
beta 1.0 Temperature for advantage weighting (lower = sharper)
learning_rate 3e-4 Adam learning rate for actor and critic
gamma 0.99 Discount factor
batch_size 256 Minibatch size for SGD
buffer_size 100_000 Replay buffer capacity
hidden 256 Hidden layer width
learning_starts 1000 Random exploration steps before training
n_critic_updates 1 Number of critic updates per training step
max_advantage 20.0 Clamp for exponentiated advantage (prevents overflow)
seed 42 Random seed

When to Use

  • Use AWR when: you want a simple off-policy algorithm that avoids importance sampling, or you have offline data and need a straightforward baseline.
  • Prefer AWR over PPO when: you want to combine online and offline data, or importance sampling ratios are unstable.
  • Do not use AWR when: you need maximum sample efficiency on continuous control (prefer SAC or TD3), or the task requires careful entropy tuning.

References

  • Peng, X. B., Kumar, A., Zhang, G., & Levine, S. (2019). Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. arXiv:1910.00177.