Learning Path¶
Your guide to mastering reinforcement learning with rlox, from zero to production.
flowchart TD
L1[Level 1: Getting Started]
L2[Level 2: Core Concepts]
L3A[Level 3a: On-Policy]
L3B[Level 3b: Off-Policy]
L3C[Level 3c: Model-Based & Multi-Agent]
L4[Level 4: Advanced Topics]
L5[Level 5: Production & Scale]
L1 --> L2
L2 --> L3A
L2 --> L3B
L2 --> L3C
L3A --> L4
L3B --> L4
L3C --> L4
L4 --> L5
L3A -.- VPG[VPG] & A2C[A2C] & PPO[PPO] & TRPO[TRPO]
L3B -.- DQN[DQN] & TD3[TD3] & SAC[SAC] & IMPALA[IMPALA]
L3C -.- Dreamer[DreamerV3] & MAPPO[MAPPO] & QMIX[QMIX]
style L1 fill:#e8f5e9,stroke:#388e3c
style L2 fill:#e3f2fd,stroke:#1976d2
style L3A fill:#fff3e0,stroke:#f57c00
style L3B fill:#fff3e0,stroke:#f57c00
style L3C fill:#fff3e0,stroke:#f57c00
style L4 fill:#fce4ec,stroke:#c62828
style L5 fill:#f3e5f5,stroke:#7b1fa2
Level 1: Getting Started (30 minutes)¶
Goal: Install rlox, train your first agent, and see results.
Install rlox¶
Train your first agent¶
from rlox import Trainer
trainer = Trainer("ppo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=100_000)
print(f"Final return: {metrics['mean_reward']:.1f}")
Understand the Trainer API¶
The Trainer is the single entry point for all algorithms:
# Create with algorithm name + environment
trainer = Trainer("sac", env="Pendulum-v1")
# Train for N timesteps
metrics = trainer.train(total_timesteps=50_000)
# Save / load checkpoints
trainer.save("my_model")
trainer = Trainer.from_checkpoint("my_model", algorithm="sac", env="Pendulum-v1")
# Predict actions
action = trainer.predict(obs, deterministic=True)
Further reading¶
- Getting Started guide -- full installation and first-run walkthrough
- Python User Guide -- API tour and common patterns
Level 2: Core Concepts (2-3 hours)¶
Goal: Understand the building blocks of RL and the rlox architecture.
Policy gradient fundamentals¶
Read Policy Gradient Fundamentals to understand:
- The REINFORCE algorithm and log-probability trick
- Baselines and variance reduction
- From VPG to modern policy gradients
The Polars architecture¶
rlox uses a Rust data plane + Python control plane:
| Layer | Language | Responsibility |
|---|---|---|
| Data collection | Rust (via PyO3) | Rollout buffers, GAE, reward normalization |
| Training loop | Python | Gradient computation, optimizer steps |
| Configuration | Python | Dataclass configs with YAML/TOML serialization |
The Rust data plane provides 3-50x speedups over pure Python for buffer operations, GAE computation, and environment stepping.
Observations, actions, rewards¶
| Concept | Discrete (CartPole) | Continuous (MuJoCo) |
|---|---|---|
| Observation | Box(4,) float32 |
Box(N,) float32 |
| Action | Discrete(2) |
Box(M,) float32 |
| Reward | +1 per step | Task-specific |
| Algorithm | PPO, DQN, A2C | PPO, SAC, TD3 |
Further reading¶
- RL Introduction -- MDP formalism, Bellman equations, policy vs value methods
- Math Reference -- notation and key derivations
Level 3: Algorithms (1-2 days)¶
Goal: Know which algorithm to use and why.
On-policy methods¶
Learn these in order -- each builds on the previous:
- VPG -- Vanilla Policy Gradient. The simplest policy gradient. High variance, but easy to understand.
- A2C -- Advantage Actor-Critic. Adds a learned baseline (value function) to reduce variance.
- PPO -- Proximal Policy Optimization. Clips the policy ratio for stable updates. The default choice for most tasks.
- TRPO -- Trust Region Policy Optimization. Constrains the KL divergence directly. More principled but slower than PPO.
Off-policy methods¶
These reuse past experience via replay buffers:
- DQN -- Deep Q-Network. Value-based, discrete actions only. Includes Double DQN, Dueling, PER, N-step extensions.
- TD3 -- Twin Delayed DDPG. Deterministic policy for continuous control with twin critics and delayed updates.
- SAC -- Soft Actor-Critic. Maximum entropy framework for continuous control. The default off-policy choice.
Distributed methods¶
- IMPALA -- Distributed actor-learner architecture with V-trace off-policy correction. For large-scale training.
Model-based and multi-agent¶
- DreamerV3 -- Learns a world model (RSSM) and trains a policy entirely in imagination.
- MAPPO -- Multi-Agent PPO with centralized training and decentralized execution (CTDE).
Algorithm selection flowchart¶
flowchart TD
Start{What is your action space?}
Start -->|Discrete| D{Sample efficiency matters?}
Start -->|Continuous| C{Need max entropy?}
Start -->|Multi-agent| MA[MAPPO / QMIX]
Start -->|Pixel obs / world model| WM[DreamerV3]
D -->|Yes| DQN[DQN]
D -->|No| PPO[PPO]
C -->|Yes| SAC[SAC]
C -->|No| C2{Deterministic OK?}
C2 -->|Yes| TD3[TD3]
C2 -->|No| PPO2[PPO]
Further reading¶
- Algorithm taxonomy -- classification diagram and comparison table
- Research notes -- deep dives into each algorithm's paper
Level 4: Advanced Topics (1 week)¶
Goal: Go beyond vanilla training with exploration, meta-learning, and offline RL.
Intrinsic motivation¶
Sparse-reward environments need curiosity-driven exploration:
- RND (Random Network Distillation) -- prediction error as intrinsic reward
- ICM (Intrinsic Curiosity Module) -- forward/inverse model curiosity
- Go-Explore -- archive-based exploration for hard-exploration problems
Meta-learning¶
- Reptile -- first-order meta-learning for fast task adaptation
Offline RL¶
Train policies from fixed datasets without environment interaction:
- CQL -- Conservative Q-Learning with pessimistic value estimates
- Cal-QL -- Calibrated CQL with automatic conservatism tuning
- IQL -- Implicit Q-Learning without policy-dependent Bellman backups
- Decision Transformer -- sequence modeling approach to offline RL
Reward shaping¶
- PBRS (Potential-Based Reward Shaping) -- provably policy-invariant reward augmentation
Population-based training¶
- PBT -- jointly optimize hyperparameters and weights across a population
Further reading¶
- Custom Components tutorial -- extending rlox with your own modules
- Custom Rewards tutorial -- reward wrappers and custom loops
Level 5: Production and Scale¶
Goal: Deploy trained agents and scale training across machines.
Environment normalization¶
trainer = Trainer("ppo", env="HalfCheetah-v4", config={
"normalize_obs": True,
"normalize_rewards": True,
})
VecNormalize is critical for MuJoCo environments -- it maintains running statistics for observations and rewards.
Config-driven training¶
Define experiments in YAML or TOML:
algorithm: ppo
env_id: HalfCheetah-v4
total_timesteps: 1_000_000
seed: 42
hyperparameters:
learning_rate: 3.0e-4
n_steps: 2048
n_epochs: 10
callbacks: [eval, checkpoint, progress]
logger: wandb
Distributed training¶
Scale across multiple GPUs and machines:
- IMPALA -- native multi-actor distributed architecture
- gRPC workers -- for multi-node setups
- See Distributed API reference
Diagnostics dashboard¶
Monitor training in real time:
Plugin ecosystem¶
Extend rlox without modifying the core:
- Register custom environments, buffers, and reward functions via
ENV_REGISTRY,BUFFER_REGISTRY,REWARD_REGISTRY - Auto-discover third-party plugins with
discover_plugins() - See Python User Guide -- Plugin Ecosystem
Visual RL¶
Train agents from pixel observations:
FrameStack,ImagePreprocess,AtariWrapperfor standard preprocessingDMControlWrapperfor DeepMind Control Suite- See Python User Guide -- Visual RL Wrappers
Cloud deploy¶
Deploy trained agents to production:
generate_dockerfilefor containerized model servinggenerate_k8s_jobfor Kubernetes training jobsgenerate_sagemaker_configfor AWS SageMaker- See Python User Guide -- Cloud Deploy
Model zoo¶
Share and reuse pretrained agents:
ModelZoo.register/ModelZoo.loadfor model sharingModelCardmetadata for discoverability
Custom algorithms¶
Extend rlox with the protocol system:
- Implement the algorithm protocol (collect, update, get_policy)
- Register with the Trainer
- See Custom Components tutorial
Recommended reading order¶
| Day | Topic | Pages |
|---|---|---|
| 1 | Level 1 + Level 2 | This page, Getting Started, RL Intro |
| 2 | On-policy algorithms | VPG, A2C, PPO |
| 3 | Off-policy algorithms | DQN, SAC, TD3 |
| 4 | Advanced algorithms | TRPO, IMPALA, DreamerV3 |
| 5 | Multi-agent + advanced | MAPPO, intrinsic motivation, offline RL |
| 6 | Production & plugins | Config-driven training, distributed, dashboard, plugin ecosystem |
| 7 | Deploy & visual RL | Cloud deploy, visual RL wrappers, model zoo |