rlox Benchmark Results¶

Three-framework performance comparison: rlox (Rust/PyO3) vs TorchRL (PyTorch) vs Stable-Baselines3 (Python/NumPy).

Measured on Apple M4 (14 cores), macOS 26.2, Python 3.12.7, PyTorch 2.10.0. All speedups reported with bootstrap 95% confidence intervals (10,000 resamples). Last updated: 2026-03-29.

Summary¶

Infrastructure Operations¶

Category	vs SB3	vs TorchRL	vs NumPy
GAE (2048 steps)	—	1,635x	142x
Buffer push (CartPole)	4.0x	60.8x	—
Buffer sample (batch=1024)	6.1x	6.5x	—
E2E rollout (256×2048)	3.0x	40.4x	—
Env stepping (256 envs)	—	120x	6.7x (Gym)
GRPO advantages (256×16)	—	—	34x
Token KL (128 tokens)	—	—	4.7x

TRL Comparison (LLM Post-Training Primitives)¶

Category	vs TRL-style CPU	vs NumPy
GRPO advantages (16×4)	14.2x	13.0x
GRPO advantages (256×16)	6.7x	2.4x
GRPO advantages (1024×32)	4.0x	1.3x
Token KL Schulman (B=1, T=128)	5.5x	2.5x
Token KL Schulman (B=1, T=8192)	2.7x	1.5x
Token KL Schulman (B=32, T=2048)	0.6x	1.2x

Neural Network Backends (Burn vs Candle vs PyTorch)¶

Category	Burn	Candle	PyTorch
DQN TD step (batch=64)	191 us	98 us	738 us
PPO step (batch=64)	1,885 us	328 us	1,440 us
SAC sample (batch=1)	91 us	14 us	52 us
Critic step (batch=256)	2,090 us	3,453 us	2,325 us

Detailed Reports¶

Report	What it measures	Frameworks
Setup & Methodology	Hardware, software versions, statistical methods, fairness constraints	—
GAE Computation	Generalized Advantage Estimation at various trajectory lengths	rlox, TorchRL, NumPy
Buffer Operations	Push throughput and sample latency	rlox, TorchRL, SB3
End-to-End Rollout	Full pipeline: step + store + GAE	rlox, SB3, TorchRL
Environment Stepping	Single-step latency, vectorized throughput scaling	rlox, Gymnasium, SB3, TorchRL
LLM Operations	GRPO advantages, token-level KL divergence	rlox, NumPy, PyTorch
NN Backends	Inference and training step latency	Burn, Candle, PyTorch
TRL Comparison	GRPO advantages, Schulman KL vs TRL-style PyTorch	rlox, PyTorch (TRL-style), NumPy

Key Findings¶

GAE is the standout result: 142x vs Python loops because the sequential backward scan eliminates per-iteration interpreter overhead. 1,635x vs TorchRL due to TensorDict metadata overhead per element.
Buffer operations scale inversely with observation size: At CartPole (obs=4), rlox is 61x faster than TorchRL on push — per-call overhead dominates. At Atari-sized observations (obs=28,224), the gap narrows to 1.8x as memcpy dominates. SB3 edges ahead at Atari scale (0.8x) because pre-allocated NumPy arrays avoid reallocation.
End-to-end advantage compounds: Individual speedups (env stepping: ~7x, buffer: ~4x, GAE: ~142x) compound to 3.0x vs SB3 at the largest configuration. The GAE advantage is diluted because env stepping dominates wall clock time.
Env stepping scales with parallelism: rlox reaches 2.7M steps/s at 512 envs (8.2x vs Gymnasium). At low env counts (4), Rayon thread pool overhead makes rlox slower than sequential Python. The crossover is at ~16 envs.
Tail latency matters: rlox's buffer sample p99 is 17.0us at batch=1024, vs 109-75us for TorchRL/SB3. Pre-allocated ring buffer + deterministic ChaCha8 RNG eliminates GC pressure.
Small-array operations favor rlox: GRPO advantages (34x) and token KL at short sequences (4.7x) win because each Python function call to NumPy/PyTorch costs ~500ns in dispatch overhead. rlox's PyO3 boundary crossing costs ~100ns.
Candle excels at low-latency inference: At batch=1, Candle is 5-6x faster than Burn and 3-4x faster than PyTorch for inference. At larger batches (256+), Burn's GEMM-optimized NdArray backend catches up and often wins for training steps.
TRL-style GRPO: 4-14x faster on CPU: rlox's batched Rust API beats TRL's vectorized reshape + repeat_interleave approach by 4-14x on CPU. The gap narrows at larger batch sizes as compute overtakes dispatch overhead. However, in a full LLM training step, advantage computation is <0.01% of wall-clock — the value is in low-latency serving and CPU-only deployments.
Batched KL crossover at large tensors: For single sequences (B=1), rlox Schulman KL is 2.7-5.5x faster than TRL-style PyTorch. At B=32 x T=2048 (65K elements), PyTorch's SIMD-vectorized tensor ops win (0.6x) because rlox pays 32 per-sequence PyO3 crossings.

Reproducing¶

# Python benchmarks (infrastructure ops)
python benchmarks/run_all.py                   # Full suite
python benchmarks/bench_gae.py                 # Just GAE
python benchmarks/bench_buffer_ops.py          # Just buffers
python benchmarks/bench_e2e_rollout.py         # Just E2E
python benchmarks/bench_llm_ops.py             # Just LLM ops
python benchmarks/bench_env_stepping.py        # Just env stepping
python benchmarks/bench_nn_backends.py         # PyTorch NN baseline
python benchmarks/bench_trl_comparison.py      # rlox vs TRL-style ops

# Rust benchmarks (NN backends)
cargo bench -p rlox-bench --bench nn_backends  # Burn vs Candle

Raw JSON data is written to benchmark_results/ (gitignored). Criterion HTML reports are in target/criterion/.