DQN -- Deep Q-Network¶
Intuition¶
DQN approximates the optimal action-value function \(Q^*(s, a)\) with a neural network and selects actions greedily. Experience replay and a target network stabilize training. rlox's DQN includes Rainbow extensions: Double DQN (reduced overestimation), Dueling architecture (separate value and advantage streams), N-step returns, and Prioritized Experience Replay (PER).
Key Equations¶
The Bellman optimality target:
Double DQN decouples action selection from evaluation to reduce overestimation:
N-step returns extend the target horizon:
where \(d_{t+n}\) is the terminal flag (1 if episode ended, 0 otherwise).
Dueling architecture decomposes Q into value and advantage:
Prioritized Experience Replay samples proportional to TD error:
Pseudocode¶
algorithm DQN:
initialize Q-network Q_phi, target network Q_phi'
initialize replay buffer D (uniform or prioritized)
for step = 1, 2, ... do
with probability epsilon: a = random action
else: a = argmax_a Q_phi(s, a)
store (s, a, r, s', done) in D
if step >= learning_starts:
sample minibatch from D (with PER weights if enabled)
if double_dqn:
a* = argmax_a Q_phi(s', a)
y = r + gamma^n * (1-done) * Q_phi'(s', a*)
else:
y = r + gamma^n * (1-done) * max_a Q_phi'(s', a)
loss = mean(w_i * (Q_phi(s, a) - y)^2)
update phi with Adam
if prioritized: update priorities in D
every target_update_freq steps:
phi' <- phi
decay epsilon from initial_eps to final_eps
Quick Start¶
from rlox import Trainer
trainer = Trainer("dqn", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)
With Rainbow extensions:
trainer = Trainer("dqn", env="LunarLander-v3", seed=42, config={
"double_dqn": True,
"dueling": True,
"prioritized": True,
"n_step": 3,
"learning_rate": 6.3e-4,
"buffer_size": 100_000,
"batch_size": 128,
"exploration_final_eps": 0.02,
})
metrics = trainer.train(total_timesteps=200_000)
Hyperparameters¶
All defaults from DQNConfig:
| Parameter | Default | Description |
|---|---|---|
learning_rate |
1e-4 |
Adam learning rate |
buffer_size |
1_000_000 |
Replay buffer capacity |
batch_size |
64 |
Minibatch size |
gamma |
0.99 |
Discount factor |
target_update_freq |
1000 |
Steps between hard target network updates |
exploration_fraction |
0.1 |
Fraction of training for epsilon decay |
exploration_initial_eps |
1.0 |
Starting epsilon |
exploration_final_eps |
0.05 |
Final epsilon after decay |
learning_starts |
1000 |
Random exploration steps before training |
double_dqn |
True |
Use Double DQN action selection |
dueling |
False |
Use Dueling network architecture |
n_step |
1 |
N-step return horizon |
prioritized |
False |
Use Prioritized Experience Replay |
alpha |
0.6 |
PER priority exponent |
beta_start |
0.4 |
PER initial importance-sampling exponent |
hidden |
256 |
Hidden layer width |
When to Use¶
- Use DQN when: your action space is discrete and you want sample-efficient off-policy training with replay.
- Do not use DQN when: your action space is continuous (use SAC or TD3) or you want a simpler on-policy method (use PPO).
References¶
- Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
- van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI.
- Wang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML.
- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. ICLR.