Skip to content

Algorithm Reference

rlox implements a broad set of reinforcement learning algorithms spanning model-free on-policy, model-free off-policy, model-based, multi-agent, distributed, offline, and LLM post-training paradigms.

Taxonomy

graph TD
    A[RL Algorithms] --> B[Model-Free]
    A --> C[Model-Based]
    A --> G[Offline RL]
    A --> H[LLM Post-Training]
    B --> D[On-Policy]
    B --> E[Off-Policy]
    B --> F[Multi-Agent]
    B --> I[Distributed]
    D --> VPG[<a href='vpg/'>VPG</a>]
    D --> A2C_node[<a href='a2c/'>A2C</a>]
    D --> PPO_node[<a href='ppo/'>PPO</a>]
    D --> TRPO_node[<a href='trpo/'>TRPO</a>]
    E --> DQN_node[<a href='dqn/'>DQN</a>]
    E --> TD3_node[<a href='td3/'>TD3</a>]
    E --> SAC_node[<a href='sac/'>SAC</a>]
    E --> MPO_node[<a href='mpo/'>MPO</a>]
    F --> MAPPO_node[<a href='mappo/'>MAPPO</a>]
    F --> QMIX_node[<a href='qmix/'>QMIX</a>]
    I --> IMPALA_node[<a href='impala/'>IMPALA</a>]
    C --> Dreamer_node[<a href='dreamer/'>DreamerV3</a>]
    G --> CQL_node[<a href='cql/'>CQL</a>]
    G --> CalQL_node[<a href='calql/'>Cal-QL</a>]
    G --> IQL_node[<a href='iql/'>IQL</a>]
    G --> TD3BC_node[<a href='td3bc/'>TD3+BC</a>]
    G --> BC_node[<a href='bc/'>BC</a>]
    G --> AWR_node[<a href='awr/'>AWR</a>]
    G --> DT_node[<a href='dt/'>Decision Transformer</a>]
    G --> Diff_node[<a href='diffusion/'>Diffusion Policy</a>]
    H --> GRPO_node[<a href='grpo/'>GRPO</a>]
    H --> DPO_node[<a href='dpo/'>DPO</a>]

    style A fill:#e8eaf6,stroke:#3949ab
    style B fill:#e3f2fd,stroke:#1976d2
    style C fill:#fff3e0,stroke:#f57c00
    style D fill:#e8f5e9,stroke:#388e3c
    style E fill:#fce4ec,stroke:#c62828
    style F fill:#f3e5f5,stroke:#7b1fa2
    style G fill:#e0f7fa,stroke:#00838f
    style H fill:#fff8e1,stroke:#f9a825
    style I fill:#fbe9e7,stroke:#d84315

Comparison Table

Algorithm Action Space Policy Type Data Efficiency Stability Complexity
VPG Discrete / Continuous Stochastic Low Low Minimal
A2C Discrete / Continuous Stochastic Low Medium Low
PPO Discrete / Continuous Stochastic Low High Low
TRPO Discrete / Continuous Stochastic Low High Medium
DQN Discrete only Value-based Medium Medium Low
TD3 Continuous only Deterministic High High Medium
SAC Continuous Stochastic High High Medium
MPO Continuous Stochastic High High High
IMPALA Discrete / Continuous Stochastic Medium Medium High
DreamerV3 Discrete / Continuous Learned model Very high Medium High
MAPPO Discrete / Continuous Stochastic (CTDE) Low High Medium
QMIX Discrete only Value decomposition Medium Medium Medium
CQL Continuous Stochastic (offline) N/A (offline) High Medium
Cal-QL Continuous Stochastic (offline) N/A (offline) High Medium
IQL Continuous Deterministic (offline) N/A (offline) High Low
TD3+BC Continuous Deterministic (offline) N/A (offline) High Low
BC Discrete / Continuous Supervised N/A (offline) High Minimal
AWR Discrete / Continuous Stochastic Medium Medium Low
Decision Transformer Discrete / Continuous Sequence model N/A (offline) High Medium
Diffusion Policy Continuous Diffusion N/A (offline) High High
GRPO Token sequences Stochastic (LLM) N/A Medium Medium
DPO Token sequences Stochastic (LLM) N/A High Low

Choosing an algorithm

Start with PPO. It works across discrete and continuous action spaces, is stable, and requires minimal tuning. Branch out from there:

  • Continuous control with sample efficiency constraints -- use SAC or TD3
  • Principled off-policy with KL constraints -- use MPO
  • Discrete actions with replay -- use DQN (with Double + Dueling extensions)
  • Multi-agent cooperative tasks -- use MAPPO or QMIX
  • Pixel observations or complex dynamics -- use DreamerV3
  • Large-scale distributed training -- use IMPALA
  • Formal trust-region guarantees -- use TRPO
  • Offline RL (fixed dataset, no interaction):
    • Start with IQL or TD3+BC for simplicity
    • Use CQL or Cal-QL for stronger value conservatism
    • Use BC when data is expert-quality
    • Use Decision Transformer for large datasets with return conditioning
    • Use Diffusion Policy for multimodal action distributions
    • Use AWR for a simple advantage-weighted approach
  • LLM post-training:
    • Use DPO when you have pairwise preference data
    • Use GRPO for reward-based optimization without a critic

All algorithms

On-policy

Off-policy

Distributed

Model-based

Multi-agent

Offline RL

Policy as Diffusion

LLM Post-Training