TD3+BC -- A Minimalist Approach to Offline RL¶
Intuition¶
TD3+BC takes the simplest possible approach to offline RL: add a behavioral cloning (BC) term to the TD3 actor loss. The BC regularizer penalizes the policy for deviating from dataset actions, preventing it from exploiting Q-value overestimation on out-of-distribution actions. The weight of the BC term is normalized by the Q-value magnitude, so the algorithm automatically balances between maximizing Q-values and staying close to the data. Despite its simplicity, TD3+BC is competitive with much more complex offline RL algorithms.
Key Equations¶
The actor loss combines Q-value maximization with BC regularization:
where the balancing coefficient is:
The critic update follows standard TD3 with clipped double Q-learning and target policy smoothing:
Pseudocode¶
algorithm TD3+BC:
initialize actor pi, twin critics Q1, Q2, target networks
load offline dataset D
for update = 1 to n_updates do
sample minibatch from D
# Critic update (same as TD3)
target_noise = clip(N(0, sigma), -c, c)
a' = clip(pi_target(s') + target_noise, -a_max, a_max)
y = r + gamma * (1 - done) * min(Q1_target(s', a'), Q2_target(s', a'))
update Q1, Q2 with MSE(Q(s,a), y)
# Actor update (every policy_delay steps)
if update % policy_delay == 0:
lambda = alpha / mean(|Q1(s, a_data)|)
L = -lambda * mean(Q1(s, pi(s))) + MSE(pi(s), a_data)
update pi
# Soft target updates
polyak_update(all targets, tau)
Quick Start¶
TD3+BC uses the offline algorithm interface:
from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.td3_bc import TD3BC
dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-medium-v2")
agent = TD3BC(
dataset=dataset,
obs_dim=17,
act_dim=6,
alpha=2.5,
)
metrics = agent.train(n_updates=100_000)
Hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
alpha |
2.5 |
BC regularization weight (higher = more conservative) |
hidden |
256 |
Hidden layer width |
learning_rate |
3e-4 |
Learning rate |
tau |
0.005 |
Soft target update rate |
gamma |
0.99 |
Discount factor |
policy_delay |
2 |
Actor update frequency (every N critic updates) |
target_noise |
0.2 |
Target policy smoothing noise std |
noise_clip |
0.5 |
Target noise clipping range |
act_high |
1.0 |
Action space upper bound |
batch_size |
256 |
Minibatch size |
When to Use¶
- Use TD3+BC when: you want a dead-simple offline RL baseline that is easy to implement and tune, with continuous action spaces.
- Prefer TD3+BC over CQL/IQL when: simplicity and reproducibility matter more than squeezing out extra performance.
- Do not use TD3+BC when: the offline dataset is very sub-optimal (the BC term will anchor the policy too close to bad data), or you need discrete actions.
References¶
- Fujimoto, S. & Gu, S. S. (2021). A Minimalist Approach to Offline Reinforcement Learning. NeurIPS 2021. arXiv:2106.06860.