Skip to content

Learning Path

Your guide to mastering reinforcement learning with rlox, from zero to production.

flowchart TD
    L1[Level 1: Getting Started]
    L2[Level 2: Core Concepts]
    L3A[Level 3a: On-Policy]
    L3B[Level 3b: Off-Policy]
    L3C[Level 3c: Model-Based & Multi-Agent]
    L4[Level 4: Advanced Topics]
    L5[Level 5: Production & Scale]

    L1 --> L2
    L2 --> L3A
    L2 --> L3B
    L2 --> L3C
    L3A --> L4
    L3B --> L4
    L3C --> L4
    L4 --> L5

    L3A -.- VPG[VPG] & A2C[A2C] & PPO[PPO] & TRPO[TRPO]
    L3B -.- DQN[DQN] & TD3[TD3] & SAC[SAC] & IMPALA[IMPALA]
    L3C -.- Dreamer[DreamerV3] & MAPPO[MAPPO] & QMIX[QMIX]

    style L1 fill:#e8f5e9,stroke:#388e3c
    style L2 fill:#e3f2fd,stroke:#1976d2
    style L3A fill:#fff3e0,stroke:#f57c00
    style L3B fill:#fff3e0,stroke:#f57c00
    style L3C fill:#fff3e0,stroke:#f57c00
    style L4 fill:#fce4ec,stroke:#c62828
    style L5 fill:#f3e5f5,stroke:#7b1fa2

Level 1: Getting Started (30 minutes)

Goal: Install rlox, train your first agent, and see results.

Install rlox

pip install rlox

Train your first agent

from rlox import Trainer

trainer = Trainer("ppo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)
print(f"Final return: {metrics['mean_reward']:.1f}")

Understand the Trainer API

The Trainer is the single entry point for all algorithms:

# Create with algorithm name + environment
trainer = Trainer("sac", env="Pendulum-v1")

# Train for N timesteps
metrics = trainer.train(total_timesteps=50_000)

# Save / load checkpoints
trainer.save("my_model")
trainer = Trainer.from_checkpoint("my_model", algorithm="sac", env="Pendulum-v1")

# Predict actions
action = trainer.predict(obs, deterministic=True)

Further reading


Level 2: Core Concepts (2-3 hours)

Goal: Understand the building blocks of RL and the rlox architecture.

Policy gradient fundamentals

Read Policy Gradient Fundamentals to understand:

  • The REINFORCE algorithm and log-probability trick
  • Baselines and variance reduction
  • From VPG to modern policy gradients

The Polars architecture

rlox uses a Rust data plane + Python control plane:

Layer Language Responsibility
Data collection Rust (via PyO3) Rollout buffers, GAE, reward normalization
Training loop Python Gradient computation, optimizer steps
Configuration Python Dataclass configs with YAML/TOML serialization

The Rust data plane provides 3-50x speedups over pure Python for buffer operations, GAE computation, and environment stepping.

Observations, actions, rewards

Concept Discrete (CartPole) Continuous (MuJoCo)
Observation Box(4,) float32 Box(N,) float32
Action Discrete(2) Box(M,) float32
Reward +1 per step Task-specific
Algorithm PPO, DQN, A2C PPO, SAC, TD3

Further reading


Level 3: Algorithms (1-2 days)

Goal: Know which algorithm to use and why.

On-policy methods

Learn these in order -- each builds on the previous:

  1. VPG -- Vanilla Policy Gradient. The simplest policy gradient. High variance, but easy to understand.
  2. A2C -- Advantage Actor-Critic. Adds a learned baseline (value function) to reduce variance.
  3. PPO -- Proximal Policy Optimization. Clips the policy ratio for stable updates. The default choice for most tasks.
  4. TRPO -- Trust Region Policy Optimization. Constrains the KL divergence directly. More principled but slower than PPO.

Off-policy methods

These reuse past experience via replay buffers:

  1. DQN -- Deep Q-Network. Value-based, discrete actions only. Includes Double DQN, Dueling, PER, N-step extensions.
  2. TD3 -- Twin Delayed DDPG. Deterministic policy for continuous control with twin critics and delayed updates.
  3. SAC -- Soft Actor-Critic. Maximum entropy framework for continuous control. The default off-policy choice.

Distributed methods

  1. IMPALA -- Distributed actor-learner architecture with V-trace off-policy correction. For large-scale training.

Model-based and multi-agent

  • DreamerV3 -- Learns a world model (RSSM) and trains a policy entirely in imagination.
  • MAPPO -- Multi-Agent PPO with centralized training and decentralized execution (CTDE).

Algorithm selection flowchart

flowchart TD
    Start{What is your action space?}
    Start -->|Discrete| D{Sample efficiency matters?}
    Start -->|Continuous| C{Need max entropy?}
    Start -->|Multi-agent| MA[MAPPO / QMIX]
    Start -->|Pixel obs / world model| WM[DreamerV3]

    D -->|Yes| DQN[DQN]
    D -->|No| PPO[PPO]

    C -->|Yes| SAC[SAC]
    C -->|No| C2{Deterministic OK?}
    C2 -->|Yes| TD3[TD3]
    C2 -->|No| PPO2[PPO]

Further reading


Level 4: Advanced Topics (1 week)

Goal: Go beyond vanilla training with exploration, meta-learning, and offline RL.

Intrinsic motivation

Sparse-reward environments need curiosity-driven exploration:

  • RND (Random Network Distillation) -- prediction error as intrinsic reward
  • ICM (Intrinsic Curiosity Module) -- forward/inverse model curiosity
  • Go-Explore -- archive-based exploration for hard-exploration problems

Meta-learning

  • Reptile -- first-order meta-learning for fast task adaptation

Offline RL

Train policies from fixed datasets without environment interaction:

  • CQL -- Conservative Q-Learning with pessimistic value estimates
  • Cal-QL -- Calibrated CQL with automatic conservatism tuning
  • IQL -- Implicit Q-Learning without policy-dependent Bellman backups
  • Decision Transformer -- sequence modeling approach to offline RL

Reward shaping

  • PBRS (Potential-Based Reward Shaping) -- provably policy-invariant reward augmentation

Population-based training

  • PBT -- jointly optimize hyperparameters and weights across a population

Further reading


Level 5: Production and Scale

Goal: Deploy trained agents and scale training across machines.

Environment normalization

trainer = Trainer("ppo", env="HalfCheetah-v4", config={
    "normalize_obs": True,
    "normalize_rewards": True,
})

VecNormalize is critical for MuJoCo environments -- it maintains running statistics for observations and rewards.

Config-driven training

Define experiments in YAML or TOML:

algorithm: ppo
env_id: HalfCheetah-v4
total_timesteps: 1_000_000
seed: 42
hyperparameters:
  learning_rate: 3.0e-4
  n_steps: 2048
  n_epochs: 10
callbacks: [eval, checkpoint, progress]
logger: wandb
from rlox.config import TrainingConfig
cfg = TrainingConfig.from_yaml("experiment.yaml")

Distributed training

Scale across multiple GPUs and machines:

Diagnostics dashboard

Monitor training in real time:

from rlox.dashboard import launch_dashboard
launch_dashboard(log_dir="runs/")

See Dashboard API reference.

Plugin ecosystem

Extend rlox without modifying the core:

  • Register custom environments, buffers, and reward functions via ENV_REGISTRY, BUFFER_REGISTRY, REWARD_REGISTRY
  • Auto-discover third-party plugins with discover_plugins()
  • See Python User Guide -- Plugin Ecosystem

Visual RL

Train agents from pixel observations:

Cloud deploy

Deploy trained agents to production:

  • generate_dockerfile for containerized model serving
  • generate_k8s_job for Kubernetes training jobs
  • generate_sagemaker_config for AWS SageMaker
  • See Python User Guide -- Cloud Deploy

Model zoo

Share and reuse pretrained agents:

  • ModelZoo.register / ModelZoo.load for model sharing
  • ModelCard metadata for discoverability

Custom algorithms

Extend rlox with the protocol system:


Day Topic Pages
1 Level 1 + Level 2 This page, Getting Started, RL Intro
2 On-policy algorithms VPG, A2C, PPO
3 Off-policy algorithms DQN, SAC, TD3
4 Advanced algorithms TRPO, IMPALA, DreamerV3
5 Multi-agent + advanced MAPPO, intrinsic motivation, offline RL
6 Production & plugins Config-driven training, distributed, dashboard, plugin ecosystem
7 Deploy & visual RL Cloud deploy, visual RL wrappers, model zoo