Skip to content

TD3 — Twin Delayed Deep Deterministic Policy Gradient

Fujimoto et al., "Addressing Function Approximation Error in Actor-Critic Methods," ICML, 2018.

Key Idea

TD3 identifies three failure modes in DDPG (overestimation bias, high variance targets, policy-value coupling) and addresses them with three techniques: (1) twin Q-networks with minimum used for targets (clipped double-Q), (2) delayed policy updates (update policy less frequently than critics), and (3) target policy smoothing (add noise to target actions as a regularizer).

Mathematical Formulation

Critic loss (twin Q-networks):

L(φ_i) = E_{(s,a,r,s')~D} [ (Q_{φ_i}(s,a) - y)² ]   for i = 1, 2

y = r + γ · min_{i=1,2} Q_{φ_i'}(s', π_{θ'}(s') + ε)
ε ~ clip(N(0, σ), -c, c)     (target policy smoothing)

Actor loss (deterministic policy gradient, delayed):

L(θ) = -E_{s~D} [ Q_{φ_1}(s, π_θ(s)) ]
Updated every d steps (typically d=2)

Properties

  • Off-policy, model-free
  • Actor-critic with deterministic policy
  • Continuous action spaces

Key Hyperparameters

Parameter Typical Value Notes
γ 0.99 Discount
τ 0.005 Target net Polyak averaging
Policy delay d 2 Actor update frequency
Target noise σ 0.2 Smoothing noise std
Noise clip c 0.5 Smoothing noise bound
Exploration noise N(0, 0.1) During data collection
Replay buffer 1e6
Batch size 256
Learning rate 3e-4

Complexity

  • Comparable to SAC; slightly cheaper (no entropy computation, no log-prob)
  • 4 networks total (actor, 2 critics, 1 target actor)
  • Memory dominated by replay buffer

Primary Use Cases

  • Continuous control benchmarks (MuJoCo)
  • Robotics tasks where deterministic policies are preferred
  • Environments where SAC's Gaussian stochasticity is undesirable

Known Limitations

  1. Deterministic policy — no inherent exploration (relies on additive Gaussian noise)
  2. Generally slightly worse than SAC on most benchmarks
  3. Sensitive to exploration noise scale
  4. Brittle in sparse reward environments
  5. Delay hyperparameter d requires per-domain tuning

Major Variants

Variant Reference Key Change
DDPG Lillicrap et al., ICLR 2016 Predecessor (no twin critics/delay)
TD7 Fujimoto et al., ICML 2023 Learned representations + checkpoint critic

Relationship to Other Algorithms

  • SAC is the main competitor; generally preferred for automatic exploration
  • Shares twin Q-network idea with SAC (SAC adopted it from TD3)
  • DDPG is the direct predecessor
  • Both TD3 and SAC build on the Deterministic Policy Gradient theorem (Silver et al., 2014)

Industry Deployment

  • Robotics research (simpler than SAC when entropy tuning is problematic)
  • Available in SB3, CleanRL, RLlib
  • Less common in production than SAC or PPO