Benchmark Setup & Methodology¶

Hardware¶

Component	Specification
CPU	Apple M4 (arm64), 14 cores
OS	macOS 26.2 (Darwin 25.2.0)
RAM	Unified memory architecture

Software Versions¶

Package	Version
Python	3.12.7
NumPy	2.4.2
PyTorch	2.10.0 (CPU only, no CUDA)
TorchRL	0.11.1
Stable-Baselines3	2.7.1
tensordict	0.11.0
Gymnasium	1.2.3
rlox	1.0.0 (Rust via PyO3 0.23.5, maturin --release)

Statistical Methodology¶

Timing¶

Clock: time.perf_counter_ns() — monotonic, nanosecond resolution
No GC manipulation: We do not disable garbage collection. GC pauses are part of the real performance profile.
CPU: All benchmarks run on CPU. No GPU benchmarks.

Warmup & Repetitions¶

Benchmark Type	Warmup	Measurement Reps	Rationale
Single-step latency	200	1000	Amortize PyO3/JIT init
Vectorized env stepping	5 rounds	50 (20 for subprocess)	Amortize Rayon/process warmup
Buffer push	2 rounds	10	Amortize initial allocations
Buffer sample	10	100	Amortize cache priming
GAE computation	10	100	Amortize numpy/torch alloc
End-to-end rollout	1 full rollout	10	Amortize everything

Summary Statistics¶

Primary metric: Median (robust to GC pauses and context switches)
Dispersion: IQR (p25–p75), p99 (tail latency)
Comparison: Bootstrap 95% confidence interval on the speedup ratio
10,000 bootstrap resamples
Resample medians of both distributions, compute ratio
Report [p2.5, p97.5] of the ratio distribution
Significance: CI lower bound > 1.0 means rlox is statistically faster

Fairness Constraints¶

Same env dynamics: rlox has a native CartPole; TorchRL and SB3 use Gymnasium's CartPole-v1. Both implement identical physics (validated via correctness tests).
Same observation shapes: CartPole = (4,) float32 across all frameworks.
Same batch sizes: Buffer sample sizes and GAE trajectory lengths are identical.
Deterministic seeding: seed=42 everywhere. Buffer sampling uses deterministic ChaCha8 RNG.
CPU only: TorchRL can offload to GPU; we use device="cpu" for fair comparison.
Exclude one-time setup: Env creation and buffer allocation are not included in timing.
Idiomatic API: Each framework is used via its recommended API (TorchRL with TensorDict, SB3 with its ReplayBuffer API).

GIL Considerations¶

Framework	GIL Behavior
rlox	Releases GIL during all Rust computation (Rayon threads run freely)
TorchRL	Releases GIL in C++ PyTorch kernels; TensorDict operations hold GIL
SB3	Holds GIL throughout (pure Python + NumPy)

Reproducing¶

# Install dependencies
python3 -m venv .venv && source .venv/bin/activate
pip install maturin numpy gymnasium stable-baselines3 torchrl

# Build rlox
maturin develop --release

# Run full suite
python benchmarks/run_all.py

# Or individual benchmarks
python benchmarks/bench_buffer_ops.py
python benchmarks/bench_gae.py
python benchmarks/bench_llm_ops.py
python benchmarks/bench_e2e_rollout.py
python benchmarks/bench_env_stepping.py

Raw JSON results are written to benchmark_results/ (gitignored). Each file contains full timing arrays, system info, and comparison data.