DQN -- Deep Q-Network¶

Intuition¶

DQN approximates the optimal action-value function \(Q^*(s, a)\) with a neural network and selects actions greedily. Experience replay and a target network stabilize training. rlox's DQN includes Rainbow extensions: Double DQN (reduced overestimation), Dueling architecture (separate value and advantage streams), N-step returns, and Prioritized Experience Replay (PER).

Key Equations¶

The Bellman optimality target:

\[ y_t = r_t + \gamma (1 - d_t) \max_{a'} Q_{\phi'}(s_{t+1}, a') \]

Double DQN decouples action selection from evaluation to reduce overestimation:

\[ y_t = r_t + \gamma (1 - d_t) \, Q_{\phi'}(s_{t+1}, \arg\max_{a'} Q_\phi(s_{t+1}, a')) \]

N-step returns extend the target horizon:

\[ y_t^{(n)} = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n (1 - d_{t+n}) \max_{a'} Q_{\phi'}(s_{t+n}, a') \]

where \(d_{t+n}\) is the terminal flag (1 if episode ended, 0 otherwise).

Dueling architecture decomposes Q into value and advantage:

\[ Q(s, a) = V(s) + A(s, a) - \frac{1}{|\mathcal{A}|} \sum_{a'} A(s, a') \]

Prioritized Experience Replay samples proportional to TD error:

\[ p_i = |\delta_i| + \epsilon, \quad P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha} \]

Pseudocode¶

algorithm DQN:
    initialize Q-network Q_phi, target network Q_phi'
    initialize replay buffer D (uniform or prioritized)

    for step = 1, 2, ... do
        with probability epsilon: a = random action
        else: a = argmax_a Q_phi(s, a)

        store (s, a, r, s', done) in D

        if step >= learning_starts:
            sample minibatch from D (with PER weights if enabled)

            if double_dqn:
                a* = argmax_a Q_phi(s', a)
                y = r + gamma^n * (1-done) * Q_phi'(s', a*)
            else:
                y = r + gamma^n * (1-done) * max_a Q_phi'(s', a)

            loss = mean(w_i * (Q_phi(s, a) - y)^2)
            update phi with Adam

            if prioritized: update priorities in D

        every target_update_freq steps:
            phi' <- phi

        decay epsilon from initial_eps to final_eps

Quick Start¶

from rlox import Trainer

trainer = Trainer("dqn", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)

With Rainbow extensions:

trainer = Trainer("dqn", env="LunarLander-v3", seed=42, config={
    "double_dqn": True,
    "dueling": True,
    "prioritized": True,
    "n_step": 3,
    "learning_rate": 6.3e-4,
    "buffer_size": 100_000,
    "batch_size": 128,
    "exploration_final_eps": 0.02,
})
metrics = trainer.train(total_timesteps=200_000)

Hyperparameters¶

All defaults from DQNConfig:

Parameter	Default	Description
`learning_rate`	`1e-4`	Adam learning rate
`buffer_size`	`1_000_000`	Replay buffer capacity
`batch_size`	`64`	Minibatch size
`gamma`	`0.99`	Discount factor
`target_update_freq`	`1000`	Steps between hard target network updates
`exploration_fraction`	`0.1`	Fraction of training for epsilon decay
`exploration_initial_eps`	`1.0`	Starting epsilon
`exploration_final_eps`	`0.05`	Final epsilon after decay
`learning_starts`	`1000`	Random exploration steps before training
`double_dqn`	`True`	Use Double DQN action selection
`dueling`	`False`	Use Dueling network architecture
`n_step`	`1`	N-step return horizon
`prioritized`	`False`	Use Prioritized Experience Replay
`alpha`	`0.6`	PER priority exponent
`beta_start`	`0.4`	PER initial importance-sampling exponent
`hidden`	`256`	Hidden layer width

When to Use¶

Use DQN when: your action space is discrete and you want sample-efficient off-policy training with replay.
Do not use DQN when: your action space is continuous (use SAC or TD3) or you want a simpler on-policy method (use PPO).

References¶

Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. AAAI.
Wang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. ICML.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. ICLR.