Skip to content

rlox logo

rlox — Rust-Accelerated Reinforcement Learning

The Polars architecture pattern applied to RL: Rust data plane + Python control plane.


Why rlox?

RL frameworks like Stable-Baselines3 and TorchRL do everything in Python. This works, but Python interpreter overhead becomes the bottleneck long before your GPU does.

rlox moves the compute-heavy, latency-sensitive work (environment stepping, buffers, GAE) to Rust while keeping training logic, configs, and neural networks in Python via PyTorch.

Result: 3-50x faster than SB3/TorchRL on data-plane operations, with the same Python API you're used to.

Quick Start

pip install rlox
from rlox import Trainer

trainer = Trainer("ppo", env="CartPole-v1", seed=42)
metrics = trainer.train(total_timesteps=50_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")

Or from the command line:

python -m rlox train --algo ppo --env CartPole-v1 --timesteps 100000

Architecture

graph TD
    subgraph Python[Python Control Plane]
        A[PPO / SAC / DQN / TD3 / A2C / MAPPO / DreamerV3 / IMPALA]
        B[PyTorch Policies & Networks]
        C[VecNormalize, Callbacks, YAML Configs, Dashboard]
    end

    subgraph Rust[Rust Data Plane — PyO3]
        D[VecEnv — Rayon Parallel Stepping]
        E[ReplayBuffer — Ring / Mmap / Priority]
        F[GAE / V-trace — Batched + Rayon]
        G[KL / GRPO — f32 + Rayon]
    end

    A --> B
    B <-->|zero-copy| D
    B <-->|zero-copy| E
    A --> F
    A --> G

What's in the Docs

Guide Who it's for What you'll learn
RL Introduction New to RL Key concepts with rlox code examples
Getting Started New to rlox Install, first training run, basic API
Python Guide All users Complete API reference with examples
Examples All users Copy-paste code for every algorithm
Custom Components Intermediate Custom networks, collectors, exploration, losses
Migrating from SB3 SB3 users Side-by-side API comparison
LLM Post-Training LLM practitioners DPO, GRPO, OnlineDPO, BestOfN
API Reference All users Auto-generated from docstrings
Benchmarks Researchers Performance comparison vs SB3/TRL
Math Reference Researchers GAE, V-trace, GRPO, DPO derivations
Rust Guide Contributors Crate architecture, extending in Rust

Benchmark Highlights

Component vs SB3 vs TorchRL
GAE (32K steps) 147x vs NumPy 1,700x
Buffer push (10K) 9.7x 148x
E2E rollout (256x2048) 3.9x 53x
GRPO advantages 35x vs NumPy 34x vs PyTorch
KL divergence (f32) 2-9x vs TRL

Algorithms

  • On-policy: PPO, A2C, IMPALA, MAPPO — multi-env via RolloutCollector
  • Off-policy: SAC, TD3, DQN (Double, Dueling, PER, N-step) — multi-env via OffPolicyCollector
  • Offline RL: TD3+BC, IQL, CQL, BC — Rust-accelerated OfflineDatasetBuffer
  • Model-based: DreamerV3
  • LLM post-training: GRPO, DPO, OnlineDPO, BestOfN
  • Hybrid: HybridPPO — Candle inference + PyTorch training (180K SPS)

All algorithms support custom networks, exploration strategies, and collectors via protocol-based injection. See the SB3 migration guide for switching from Stable-Baselines3.