DreamerV3 -- World Model RL¶

Intuition¶

DreamerV3 learns a world model from experience, then trains a policy entirely within the learned model ("imagination"). The world model uses a Recurrent State-Space Model (RSSM) that separates state into deterministic and stochastic components. Because the policy trains on imagined trajectories, DreamerV3 is extremely sample-efficient and can handle high-dimensional observations (images) directly. Version 3 introduces symlog predictions and free-nats KL balancing for improved stability across diverse domains.

Key Equations¶

RSSM dynamics (deterministic + stochastic state):

\[ h_t = f_\phi(h_{t-1}, z_{t-1}, a_{t-1}) \quad \text{(deterministic)} \]

\[ z_t \sim q_\phi(z_t | h_t, o_t) \quad \text{(posterior, from observation)} \]

\[ \hat{z}_t \sim p_\phi(z_t | h_t) \quad \text{(prior, for imagination)} \]

World model loss (reconstruction + KL):

\[ L_\text{model} = -\ln p_\phi(o_t | h_t, z_t) - \ln p_\phi(r_t | h_t, z_t) + \beta \, D_\text{KL}[q_\phi(z_t | h_t, o_t) \| p_\phi(z_t | h_t)] \]

with KL balancing:

\[ L_\text{KL} = \alpha \, D_\text{KL}[\text{sg}(q) \| p] + (1 - \alpha) \, D_\text{KL}[q \| \text{sg}(p)] \]

Imagination (actor-critic in latent space):

\[ \lambda\text{-return:} \quad V_t^\lambda = r_t + \gamma \left[ (1-\lambda) V(s_{t+1}) + \lambda V_{t+1}^\lambda \right] \]

\[ L_\text{actor} = -\mathbb{E}_{\text{imagine}} \left[ V_t^\lambda \right], \quad L_\text{critic} = \mathbb{E}_{\text{imagine}} \left[ (V(s_t) - \text{sg}(V_t^\lambda))^2 \right] \]

Pseudocode¶

algorithm DreamerV3:
    initialize RSSM world model (encoder, dynamics, decoder, reward predictor)
    initialize actor pi_theta, critic V_psi
    initialize replay buffer D

    for step = 1, 2, ... do
        # Environment interaction
        encode o_t, infer z_t ~ q(z|h,o)
        a_t ~ pi_theta(h_t, z_t)
        store (o_t, a_t, r_t) in D

        # World model training
        sample sequences of length seq_len from D
        compute RSSM states (h_t, z_t) from sequences
        L_model = reconstruction_loss + reward_loss + KL_loss(kl_balance, free_nats)
        update world model

        # Imagination training
        imagine trajectories of length imagination_horizon from current states
        compute lambda-returns V_t^lambda along imagined trajectories
        update actor to maximize V_t^lambda
        update critic to predict V_t^lambda

Quick Start¶

from rlox import Trainer

trainer = Trainer("dreamer", env="CartPole-v1", seed=42, config={
    "batch_size": 16,
    "seq_len": 50,
    "imagination_horizon": 15,
})
metrics = trainer.train(total_timesteps=100_000)

For visual control tasks:

trainer = Trainer("dreamer", env="DMC-Cheetah-Run", seed=42, config={
    "learning_rate": 1e-4,
    "buffer_size": 1_000_000,
    "batch_size": 16,
    "seq_len": 50,
    "gamma": 0.997,
    "lambda_": 0.95,
    "deter_dim": 512,
    "stoch_dim": 32,
    "stoch_classes": 32,
    "imagination_horizon": 15,
})
metrics = trainer.train(total_timesteps=1_000_000)

Hyperparameters¶

All defaults from DreamerV3Config:

Parameter	Default	Description
`learning_rate`	`1e-4`	Learning rate for all optimizers
`buffer_size`	`1_000_000`	Replay buffer capacity
`batch_size`	`16`	Number of sequences per training batch
`seq_len`	`50`	Sequence length for training
`gamma`	`0.997`	Discount factor
`lambda_`	`0.95`	Lambda for lambda-returns
`deter_dim`	`512`	Deterministic state dimension in RSSM
`stoch_dim`	`32`	Number of categorical distributions
`stoch_classes`	`32`	Classes per categorical
`hidden`	`512`	Hidden layer width
`imagination_horizon`	`15`	Steps to imagine ahead for actor-critic
`kl_balance`	`0.8`	KL balancing coefficient
`free_nats`	`1.0`	Free nats for KL loss

When to Use¶

Use DreamerV3 when: you need maximum sample efficiency, work with pixel observations, or have complex dynamics that benefit from a learned model.
Do not use DreamerV3 when: you need fast wall-clock training time (model-free methods are faster per update), or your environment is simple enough that PPO solves it in minutes.

References¶

Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv:2301.04104 (DreamerV3).
Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., & Davidson, J. (2019). Learning Latent Dynamics for Planning from Pixels. ICML (Dreamer V1).