TD3 -- Twin Delayed DDPG¶
Intuition¶
TD3 extends DDPG with three key techniques to address overestimation bias and training instability: (1) twin Q-networks where the minimum is used for targets, (2) delayed policy updates so the critic stabilizes before the actor adapts, and (3) target policy smoothing that adds clipped noise to target actions. The result is a robust deterministic policy gradient algorithm for continuous control.
Key Equations¶
Twin Q-network target (take the minimum to combat overestimation):
\[
y = r + \gamma (1 - d) \min_{i=1,2} Q_{\phi_i'}(s', \tilde{a}')
\]
Target policy smoothing (regularize the target):
\[
\tilde{a}' = \pi_{\theta'}(s') + \text{clip}(\epsilon, -c, c), \quad \epsilon \sim \mathcal{N}(0, \sigma^2)
\]
Critic loss:
\[
L(\phi_i) = \mathbb{E} \left[ (Q_{\phi_i}(s, a) - y)^2 \right]
\]
Deterministic policy gradient (delayed, every \(d\) critic updates):
\[
\nabla_\theta J(\theta) = \mathbb{E}_s \left[ \nabla_a Q_{\phi_1}(s, a) \big|_{a=\pi_\theta(s)} \nabla_\theta \pi_\theta(s) \right]
\]
Pseudocode¶
algorithm TD3:
initialize actor pi_theta, twin critics Q_phi1, Q_phi2
initialize target networks pi_theta', Q_phi1', Q_phi2'
initialize replay buffer D
for step = 1, 2, ... do
if step < learning_starts:
a ~ Uniform(action_space)
else:
a = pi_theta(s) + N(0, exploration_noise)
clip a to action bounds
store (s, a, r, s', done) in D
if step >= learning_starts:
sample minibatch from D
# Target with smoothing
a' = pi_theta'(s') + clip(N(0, target_noise), -noise_clip, noise_clip)
clip a' to action bounds
y = r + gamma * (1-done) * min(Q_phi1'(s',a'), Q_phi2'(s',a'))
# Critic update
update phi1, phi2 to minimize (Q_phi_i(s,a) - y)^2
# Delayed actor update
if step % policy_delay == 0:
update theta to maximize Q_phi1(s, pi_theta(s))
soft update: theta' <- tau*theta + (1-tau)*theta'
soft update: phi_i' <- tau*phi_i + (1-tau)*phi_i'
Quick Start¶
from rlox import Trainer
trainer = Trainer("td3", env="Pendulum-v1", seed=42)
metrics = trainer.train(total_timesteps=50_000)
For MuJoCo locomotion:
trainer = Trainer("td3", env="HalfCheetah-v4", seed=42, config={
"learning_rate": 3e-4,
"buffer_size": 1_000_000,
"learning_starts": 10_000,
"batch_size": 256,
"policy_delay": 2,
"target_noise": 0.2,
"noise_clip": 0.5,
"exploration_noise": 0.1,
})
metrics = trainer.train(total_timesteps=1_000_000)
Hyperparameters¶
All defaults from TD3Config:
| Parameter | Default | Description |
|---|---|---|
learning_rate |
3e-4 |
Adam learning rate for actor and critic |
buffer_size |
1_000_000 |
Replay buffer capacity |
batch_size |
256 |
Minibatch size |
tau |
0.005 |
Polyak averaging coefficient |
gamma |
0.99 |
Discount factor |
learning_starts |
1000 |
Random exploration steps before training |
policy_delay |
2 |
Actor update frequency relative to critic |
target_noise |
0.2 |
Noise std added to target actions |
noise_clip |
0.5 |
Clipping range for target noise |
exploration_noise |
0.1 |
Std of Gaussian exploration noise |
hidden |
256 |
Hidden layer width |
n_envs |
1 |
Number of parallel environments |
When to Use¶
- Use TD3 when: you have continuous actions, want deterministic policies, and need sample-efficient off-policy training.
- Do not use TD3 when: you want stochastic exploration (use SAC), your actions are discrete (use DQN), or you prefer on-policy simplicity (use PPO).
References¶
- Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML 2018.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. (2015). Continuous control with deep reinforcement learning. arXiv:1509.02971 (DDPG).