TD3 — Twin Delayed Deep Deterministic Policy Gradient¶
Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods," ICML, 2018.
Key Idea¶
TD3 identifies three failure modes in DDPG (overestimation bias, high variance targets, policy-value coupling) and addresses them with three techniques: (1) twin Q-networks with minimum used for targets (clipped double-Q), (2) delayed policy updates (update policy less frequently than critics), and (3) target policy smoothing (add noise to target actions as a regularizer).
Mathematical Formulation¶
Critic loss (twin Q-networks):
L(φ_i) = E_{(s,a,r,s')~D} [ (Q_{φ_i}(s,a) - y)² ] for i = 1, 2
y = r + γ · min_{i=1,2} Q_{φ_i'}(s', π_{θ'}(s') + ε)
ε ~ clip(N(0, σ), -c, c) (target policy smoothing)
Actor loss (deterministic policy gradient, delayed):
Properties¶
- Off-policy, model-free
- Actor-critic with deterministic policy
- Continuous action spaces
Key Hyperparameters¶
| Parameter | Typical Value | Notes |
|---|---|---|
γ |
0.99 | Discount |
τ |
0.005 | Target net Polyak averaging |
Policy delay d |
2 | Actor update frequency |
Target noise σ |
0.2 | Smoothing noise std |
Noise clip c |
0.5 | Smoothing noise bound |
| Exploration noise | N(0, 0.1) | During data collection |
| Replay buffer | 1e6 | |
| Batch size | 256 | |
| Learning rate | 3e-4 |
Complexity¶
- Comparable to SAC; slightly cheaper (no entropy computation, no log-prob)
- 4 networks total (actor, 2 critics, 1 target actor)
- Memory dominated by replay buffer
Primary Use Cases¶
- Continuous control benchmarks (MuJoCo)
- Robotics tasks where deterministic policies are preferred
- Environments where SAC's Gaussian stochasticity is undesirable
Known Limitations¶
- Deterministic policy — no inherent exploration (relies on additive Gaussian noise)
- Generally slightly worse than SAC on most benchmarks
- Sensitive to exploration noise scale
- Brittle in sparse reward environments
- Delay hyperparameter
drequires per-domain tuning
Major Variants¶
| Variant | Reference | Key Change |
|---|---|---|
| DDPG | Lillicrap et al., ICLR 2016 | Predecessor (no twin critics/delay) |
| TD7 | Fujimoto et al., ICML 2023 | Learned representations + checkpoint critic |
Relationship to Other Algorithms¶
- SAC is the main competitor; generally preferred for automatic exploration
- Shares twin Q-network idea with SAC (SAC adopted it from TD3)
- DDPG is the direct predecessor
- Both TD3 and SAC build on the Deterministic Policy Gradient theorem (Silver et al., 2014)
Industry Deployment¶
- Robotics research (simpler than SAC when entropy tuning is problematic)
- Available in SB3, CleanRL, RLlib
- Less common in production than SAC or PPO