SAC -- Soft Actor-Critic¶

Intuition¶

Soft Actor-Critic augments the standard RL objective with an entropy bonus, encouraging the policy to be as random as possible while still maximizing return. This maximum-entropy framework produces robust policies that explore effectively and are less sensitive to hyperparameters. SAC uses twin Q-networks to mitigate overestimation and automatically tunes the entropy coefficient.

Key Equations¶

The maximum-entropy objective:

\[ J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \, H(\pi(\cdot | s_t)) \right] \]

Soft Bellman backup for twin Q-functions:

\[ Q_\text{target}(s_t, a_t) = r_t + \gamma (1 - d_t) \left( \min_{i=1,2} Q_{\phi_i'}(s_{t+1}, \tilde{a}_{t+1}) - \alpha \log \pi_\theta(\tilde{a}_{t+1} | s_{t+1}) \right) \]

where \(\tilde{a}_{t+1} \sim \pi_\theta(\cdot | s_{t+1})\).

Policy loss (maximize Q while maximizing entropy):

\[ L_\pi(\theta) = \mathbb{E}_{s_t} \left[ \alpha \log \pi_\theta(\tilde{a}_t | s_t) - \min_{i=1,2} Q_{\phi_i}(s_t, \tilde{a}_t) \right] \]

Automatic entropy tuning:

\[ L(\alpha) = -\alpha \, \mathbb{E}_{a_t \sim \pi} \left[ \log \pi_\theta(a_t | s_t) + \bar{H} \right] \]

where \(\bar{H} = -\dim(\mathcal{A})\) is the target entropy.

Pseudocode¶

algorithm SAC:
    initialize actor pi_theta, twin critics Q_phi1, Q_phi2
    initialize target networks Q_phi1', Q_phi2'
    initialize replay buffer D, entropy coefficient alpha

    for step = 1, 2, ... do
        if step < learning_starts:
            a ~ Uniform(action_space)
        else:
            a ~ pi_theta(.|s)

        store (s, a, r, s', done) in D

        if step >= learning_starts:
            sample minibatch (s, a, r, s', done) from D
            a' ~ pi_theta(.|s')

            # Critic update
            y = r + gamma * (1-done) * (min(Q_phi1'(s',a'), Q_phi2'(s',a')) - alpha * log pi(a'|s'))
            update phi1, phi2 to minimize (Q_phi_i(s,a) - y)^2

            # Actor update
            a_new ~ pi_theta(.|s)
            update theta to minimize alpha * log pi(a_new|s) - min(Q_phi1(s,a_new), Q_phi2(s,a_new))

            # Entropy tuning (if auto_entropy)
            update alpha to minimize -alpha * (log pi(a_new|s) + target_entropy)

            # Target network update
            phi_i' <- tau * phi_i + (1-tau) * phi_i'

Quick Start¶

from rlox import Trainer

trainer = Trainer("sac", env="Pendulum-v1", seed=42)
metrics = trainer.train(total_timesteps=50_000)

For MuJoCo locomotion:

trainer = Trainer("sac", env="HalfCheetah-v4", seed=42, config={
    "learning_rate": 3e-4,
    "buffer_size": 1_000_000,
    "batch_size": 256,
    "tau": 0.005,
    "gamma": 0.99,
    "learning_starts": 5000,
})
metrics = trainer.train(total_timesteps=1_000_000)

Hyperparameters¶

All defaults from SACConfig:

Parameter	Default	Description
`learning_rate`	`3e-4`	Learning rate for all optimizers
`buffer_size`	`1_000_000`	Replay buffer capacity
`batch_size`	`256`	Minibatch size
`tau`	`0.005`	Polyak averaging coefficient for target networks
`gamma`	`0.99`	Discount factor
`target_entropy`	`None`	Target entropy (auto = \(-\dim(\mathcal{A})\))
`auto_entropy`	`True`	Automatically tune entropy coefficient
`learning_starts`	`1000`	Random exploration steps before training
`hidden`	`256`	Hidden layer width

When to Use¶

Use SAC when: you need sample-efficient continuous control, want robust exploration via entropy regularization, or have a continuous action space.
Do not use SAC when: your action space is discrete (use DQN) or you need on-policy training for strict policy gradient analysis (use PPO).

References¶

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML 2018.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., ... & Levine, S. (2018). Soft Actor-Critic Algorithms and Applications. arXiv:1812.05905.