Algorithm Reference¶
rlox implements a broad set of reinforcement learning algorithms spanning model-free on-policy, model-free off-policy, model-based, multi-agent, distributed, offline, and LLM post-training paradigms.
Taxonomy¶
graph TD
A[RL Algorithms] --> B[Model-Free]
A --> C[Model-Based]
A --> G[Offline RL]
A --> H[LLM Post-Training]
B --> D[On-Policy]
B --> E[Off-Policy]
B --> F[Multi-Agent]
B --> I[Distributed]
D --> VPG[<a href='vpg/'>VPG</a>]
D --> A2C_node[<a href='a2c/'>A2C</a>]
D --> PPO_node[<a href='ppo/'>PPO</a>]
D --> TRPO_node[<a href='trpo/'>TRPO</a>]
E --> DQN_node[<a href='dqn/'>DQN</a>]
E --> TD3_node[<a href='td3/'>TD3</a>]
E --> SAC_node[<a href='sac/'>SAC</a>]
E --> MPO_node[<a href='mpo/'>MPO</a>]
F --> MAPPO_node[<a href='mappo/'>MAPPO</a>]
F --> QMIX_node[<a href='qmix/'>QMIX</a>]
I --> IMPALA_node[<a href='impala/'>IMPALA</a>]
C --> Dreamer_node[<a href='dreamer/'>DreamerV3</a>]
G --> CQL_node[<a href='cql/'>CQL</a>]
G --> CalQL_node[<a href='calql/'>Cal-QL</a>]
G --> IQL_node[<a href='iql/'>IQL</a>]
G --> TD3BC_node[<a href='td3bc/'>TD3+BC</a>]
G --> BC_node[<a href='bc/'>BC</a>]
G --> AWR_node[<a href='awr/'>AWR</a>]
G --> DT_node[<a href='dt/'>Decision Transformer</a>]
G --> Diff_node[<a href='diffusion/'>Diffusion Policy</a>]
H --> GRPO_node[<a href='grpo/'>GRPO</a>]
H --> DPO_node[<a href='dpo/'>DPO</a>]
style A fill:#e8eaf6,stroke:#3949ab
style B fill:#e3f2fd,stroke:#1976d2
style C fill:#fff3e0,stroke:#f57c00
style D fill:#e8f5e9,stroke:#388e3c
style E fill:#fce4ec,stroke:#c62828
style F fill:#f3e5f5,stroke:#7b1fa2
style G fill:#e0f7fa,stroke:#00838f
style H fill:#fff8e1,stroke:#f9a825
style I fill:#fbe9e7,stroke:#d84315
Comparison Table¶
| Algorithm | Action Space | Policy Type | Data Efficiency | Stability | Complexity |
|---|---|---|---|---|---|
| VPG | Discrete / Continuous | Stochastic | Low | Low | Minimal |
| A2C | Discrete / Continuous | Stochastic | Low | Medium | Low |
| PPO | Discrete / Continuous | Stochastic | Low | High | Low |
| TRPO | Discrete / Continuous | Stochastic | Low | High | Medium |
| DQN | Discrete only | Value-based | Medium | Medium | Low |
| TD3 | Continuous only | Deterministic | High | High | Medium |
| SAC | Continuous | Stochastic | High | High | Medium |
| MPO | Continuous | Stochastic | High | High | High |
| IMPALA | Discrete / Continuous | Stochastic | Medium | Medium | High |
| DreamerV3 | Discrete / Continuous | Learned model | Very high | Medium | High |
| MAPPO | Discrete / Continuous | Stochastic (CTDE) | Low | High | Medium |
| QMIX | Discrete only | Value decomposition | Medium | Medium | Medium |
| CQL | Continuous | Stochastic (offline) | N/A (offline) | High | Medium |
| Cal-QL | Continuous | Stochastic (offline) | N/A (offline) | High | Medium |
| IQL | Continuous | Deterministic (offline) | N/A (offline) | High | Low |
| TD3+BC | Continuous | Deterministic (offline) | N/A (offline) | High | Low |
| BC | Discrete / Continuous | Supervised | N/A (offline) | High | Minimal |
| AWR | Discrete / Continuous | Stochastic | Medium | Medium | Low |
| Decision Transformer | Discrete / Continuous | Sequence model | N/A (offline) | High | Medium |
| Diffusion Policy | Continuous | Diffusion | N/A (offline) | High | High |
| GRPO | Token sequences | Stochastic (LLM) | N/A | Medium | Medium |
| DPO | Token sequences | Stochastic (LLM) | N/A | High | Low |
Choosing an algorithm¶
Start with PPO. It works across discrete and continuous action spaces, is stable, and requires minimal tuning. Branch out from there:
- Continuous control with sample efficiency constraints -- use SAC or TD3
- Principled off-policy with KL constraints -- use MPO
- Discrete actions with replay -- use DQN (with Double + Dueling extensions)
- Multi-agent cooperative tasks -- use MAPPO or QMIX
- Pixel observations or complex dynamics -- use DreamerV3
- Large-scale distributed training -- use IMPALA
- Formal trust-region guarantees -- use TRPO
- Offline RL (fixed dataset, no interaction):
- Start with IQL or TD3+BC for simplicity
- Use CQL or Cal-QL for stronger value conservatism
- Use BC when data is expert-quality
- Use Decision Transformer for large datasets with return conditioning
- Use Diffusion Policy for multimodal action distributions
- Use AWR for a simple advantage-weighted approach
- LLM post-training:
- Use DPO when you have pairwise preference data
- Use GRPO for reward-based optimization without a critic
All algorithms¶
On-policy¶
- VPG -- Vanilla Policy Gradient
- A2C -- Advantage Actor-Critic
- PPO -- Proximal Policy Optimization
- TRPO -- Trust Region Policy Optimization
Off-policy¶
- DQN -- Deep Q-Network
- TD3 -- Twin Delayed DDPG
- SAC -- Soft Actor-Critic
- MPO -- Maximum a Posteriori Policy Optimization
Distributed¶
Model-based¶
Multi-agent¶
Offline RL¶
- CQL -- Conservative Q-Learning
- Cal-QL -- Calibrated Conservative Q-Learning
- IQL -- Implicit Q-Learning
- TD3+BC -- TD3 with Behavioral Cloning
- BC -- Behavioral Cloning
- AWR -- Advantage Weighted Regression
- Decision Transformer -- RL via Sequence Modeling