Benchmark: Environment Stepping¶

Single-step latency and vectorized throughput — the foundation of every RL training loop. Compares rlox (Rust/Rayon) against Gymnasium, TorchRL, and Stable-Baselines3.

What is Measured¶

Single-Step Latency¶

One CartPole step + reset-on-done. Measures per-call overhead without vectorization.

Framework	Implementation
rlox	`CartPole.step(1)` — native Rust, PyO3 boundary
Gymnasium	`gym.make("CartPole-v1").step(1)` — Python/C
TorchRL	`GymEnv("CartPole-v1").step(td)` — TensorDict wrapper over Gymnasium

Vectorized Throughput¶

Step N environments in lockstep for 100 batch-steps. Measures parallelism scaling.

Framework	Implementation
rlox `VecEnv`	Rayon thread pool, true parallelism across cores
Gymnasium `SyncVectorEnv`	Sequential Python loop
Gymnasium `AsyncVectorEnv`	Multiprocessing with shared memory
SB3 `DummyVecEnv`	Sequential Python loop
SB3 `SubprocVecEnv`	Multiprocessing with pipes
TorchRL `SerialEnv`	Sequential with TensorDict wrapping
TorchRL `ParallelEnv`	Multiprocessing (crashes at >1 env, see note)

Results¶

Single-Step Latency¶

Framework	Median	IQR
rlox	292 ns	42 ns
Gymnasium	2,375 ns	125 ns
TorchRL	52,834 ns	8,968 ns

Comparison	Speedup	95% CI
vs Gymnasium	8.1x	[8.1, 8.1]
vs TorchRL	180.9x	[179.7, 182.4]

Vectorized Throughput¶

Num Envs	rlox (ms)	rlox (steps/s)	Gym Sync (ms)	vs Gym Sync	vs SB3 Dummy	vs TorchRL Serial
1	0.07	1,472,852	0.60	8.9x	9.8x	153x
4	3.61	110,859	1.49	0.4x	0.6x	16x
16	2.10	762,124	4.86	2.3x	2.9x	43x
64	4.44	1,441,739	18.12	4.1x	5.0x	80x
128	5.44	2,353,969	37.37	6.9x	8.6x	136x
256	12.44	2,058,154	82.79	6.7x	—	120x
512	19.11	2,679,404	156.12	8.2x	—	—

Bridge Overhead¶

Mode	Median
Native rlox CartPole	334 ns
GymEnv bridge (rlox wrapping Gymnasium)	2,980 ns
Bridge overhead	2,646 ns (2.6 us)

Analysis¶

Why rlox loses at 4 envs¶

At 4 envs, rlox (3.6ms) is slower than Gymnasium sync (1.5ms). This is the Rayon thread pool startup cost: for only 4 lightweight CartPole steps (~37ns each), the overhead of dispatching to Rayon threads and synchronizing exceeds the work itself. The crossover happens at ~16 envs where the parallel work justifies the thread pool overhead.

Scaling behavior¶

rlox throughput scales from 111K steps/s (4 envs) to 2.7M steps/s (512 envs) — a 24x increase. This is near-linear scaling across the M4's cores. Gymnasium sync plateaus at ~330K steps/s regardless of env count because it's sequential.

TorchRL overhead¶

TorchRL SerialEnv is 120-153x slower than rlox across all scales. The TensorDict metadata wrapping per step adds ~100us overhead for a ~37ns computation. TorchRL ParallelEnv crashes at >1 env with a tensor size mismatch error (TorchRL bug, not rlox).

Multiprocessing (SB3 SubprocVecEnv, Gymnasium AsyncVectorEnv)¶

Subprocess-based parallelism has high fixed overhead from IPC (pipes, shared memory). At 1 env, SB3 SubprocVecEnv is 163x slower than rlox. At 64 envs, the gap narrows to 7.3x as the parallel work amortizes IPC cost. rlox's in-process Rayon parallelism avoids IPC entirely.

Source: benchmarks/bench_env_stepping.py