DreamerV3 -- World Model RL¶
Intuition¶
DreamerV3 learns a world model from experience, then trains a policy entirely within the learned model ("imagination"). The world model uses a Recurrent State-Space Model (RSSM) that separates state into deterministic and stochastic components. Because the policy trains on imagined trajectories, DreamerV3 is extremely sample-efficient and can handle high-dimensional observations (images) directly. Version 3 introduces symlog predictions and free-nats KL balancing for improved stability across diverse domains.
Key Equations¶
RSSM dynamics (deterministic + stochastic state):
\[
h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1}) \quad \text{(deterministic)}
\]
\[
z_t \sim q_\phi(z_t | h_t, o_t) \quad \text{(posterior, from observation)}
\]
\[
\hat{z}_t \sim p_\phi(z_t | h_t) \quad \text{(prior, for imagination)}
\]
World model loss (reconstruction + KL):
\[
L_\text{model} = -\ln p_\phi(o_t | h_t, z_t) - \ln p_\phi(r_t | h_t, z_t) + \beta \, D_\text{KL}[q_\phi(z_t | h_t, o_t) \| p_\phi(z_t | h_t)]
\]
with KL balancing:
\[
L_\text{KL} = \alpha \, D_\text{KL}[\text{sg}(q) \| p] + (1 - \alpha) \, D_\text{KL}[q \| \text{sg}(p)]
\]
Imagination (actor-critic in latent space):
\[
\lambda\text{-return:} \quad V_t^\lambda = r_t + \gamma \left[ (1-\lambda) V(s_{t+1}) + \lambda V_{t+1}^\lambda \right]
\]
\[
L_\text{actor} = -\mathbb{E}_{\text{imagine}} \left[ V_t^\lambda \right], \quad L_\text{critic} = \mathbb{E}_{\text{imagine}} \left[ (V(s_t) - \text{sg}(V_t^\lambda))^2 \right]
\]
Pseudocode¶
algorithm DreamerV3:
initialize RSSM world model (encoder, dynamics, decoder, reward predictor)
initialize actor pi_theta, critic V_psi
initialize replay buffer D
for step = 1, 2, ... do
# Environment interaction
encode o_t, infer z_t ~ q(z|h,o)
a_t ~ pi_theta(h_t, z_t)
store (o_t, a_t, r_t) in D
# World model training
sample sequences of length seq_len from D
compute RSSM states (h_t, z_t) from sequences
L_model = reconstruction_loss + reward_loss + KL_loss(kl_balance, free_nats)
update world model
# Imagination training
imagine trajectories of length imagination_horizon from current states
compute lambda-returns V_t^lambda along imagined trajectories
update actor to maximize V_t^lambda
update critic to predict V_t^lambda
Quick Start¶
from rlox import Trainer
trainer = Trainer("dreamer", env="CartPole-v1", seed=42, config={
"batch_size": 16,
"seq_len": 50,
"imagination_horizon": 15,
})
metrics = trainer.train(total_timesteps=100_000)
For visual control tasks:
trainer = Trainer("dreamer", env="DMC-Cheetah-Run", seed=42, config={
"learning_rate": 1e-4,
"buffer_size": 1_000_000,
"batch_size": 16,
"seq_len": 50,
"gamma": 0.997,
"lambda_": 0.95,
"deter_dim": 512,
"stoch_dim": 32,
"stoch_classes": 32,
"imagination_horizon": 15,
})
metrics = trainer.train(total_timesteps=1_000_000)
Hyperparameters¶
All defaults from DreamerV3Config:
| Parameter | Default | Description |
|---|---|---|
learning_rate |
1e-4 |
Learning rate for all optimizers |
buffer_size |
1_000_000 |
Replay buffer capacity |
batch_size |
16 |
Number of sequences per training batch |
seq_len |
50 |
Sequence length for training |
gamma |
0.997 |
Discount factor |
lambda_ |
0.95 |
Lambda for lambda-returns |
deter_dim |
512 |
Deterministic state dimension in RSSM |
stoch_dim |
32 |
Number of categorical distributions |
stoch_classes |
32 |
Classes per categorical |
hidden |
512 |
Hidden layer width |
imagination_horizon |
15 |
Steps to imagine ahead for actor-critic |
kl_balance |
0.8 |
KL balancing coefficient |
free_nats |
1.0 |
Free nats for KL loss |
When to Use¶
- Use DreamerV3 when: you need maximum sample efficiency, work with pixel observations, or have complex dynamics that benefit from a learned model.
- Do not use DreamerV3 when: you need fast wall-clock training time (model-free methods are faster per update), or your environment is simple enough that PPO solves it in minutes.
References¶
- Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104 (DreamerV3).
- Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML (Dreamer V1).