Benchmark: End-to-End Rollout Collection¶
The most realistic benchmark — measures the full PPO-style rollout pipeline: reset M environments, step N times, store all transitions, compute GAE.
What is Measured¶
Each framework executes the complete pipeline:
| Framework | Pipeline |
|---|---|
| rlox | VecEnv.reset_all() → loop N: VecEnv.step_all(actions) + ExperienceTable.push() per env → compute_gae() |
| SB3 | DummyVecEnv.reset() → loop N: env.step(actions) + collect obs/rewards → Python GAE loop |
| TorchRL | SerialEnv.reset() → loop N: env.step(td) + collect TensorDicts → generalized_advantage_estimate() |
All use a dummy policy (action=0) to isolate infrastructure cost from neural network inference.
Configurations¶
| Config | Envs (M) | Steps (N) | Total Transitions |
|---|---|---|---|
| Small | 16 | 128 | 2,048 |
| Medium | 64 | 512 | 32,768 |
| Large | 256 | 2,048 | 524,288 |
Results¶
Raw Timings & Throughput¶
| Config | rlox | SB3 | TorchRL | rlox throughput |
|---|---|---|---|---|
| 16 × 128 (2K) | 5.8 ms | 10.0 ms | 126.5 ms | 353,698 trans/s |
| 64 × 512 (33K) | 60.5 ms | 132.8 ms | 1,722.1 ms | 541,695 trans/s |
| 256 × 2048 (524K) | 680.7 ms | 2,048.3 ms | 27,501.5 ms | 770,168 trans/s |
Speedup¶
| Config | vs SB3 | 95% CI | vs TorchRL | 95% CI |
|---|---|---|---|---|
| 16 × 128 | 1.7x | [1.7, 1.8] | 21.9x | [21.4, 23.3] |
| 64 × 512 | 2.2x | [2.2, 2.2] | 28.5x | [28.0, 28.8] |
| 256 × 2048 | 3.0x | [3.0, 3.0] | 40.4x | [40.1, 40.6] |
Analysis¶
Why rlox advantage grows with scale¶
At 16 envs, rlox is 1.7x faster than SB3. At 256 envs, it's 3.0x. The scaling comes from:
-
Environment stepping: rlox
VecEnvuses Rayon thread pool (true parallelism across cores). SB3DummyVecEnvis sequential Python — stepping 256 CartPole envs sequentially is 256× the single-step cost. -
Transition storage: rlox pushes transitions with a single PyO3 boundary crossing per env-step. SB3 creates Python dicts, copies numpy arrays, and does per-element validation.
-
GAE computation: rlox computes GAE in ~540us for 524K steps. SB3's Python loop takes ~470ms for the same data (the Python GAE loop over 524K steps is significant at scale).
-
Compounding: Each advantage compounds. At 256×2048, the env stepping advantage (~7x) combines with buffer push advantage (~4x for small obs) and GAE advantage (~142x) into an overall 3.0x.
TorchRL overhead¶
TorchRL is 40x slower at the largest scale. The TensorDict abstraction wraps every observation, reward, done flag, and action in typed tensor metadata. For lightweight environments like CartPole, the metadata cost far exceeds the actual computation.
Peak throughput¶
rlox reaches 770K transitions/second at 256 envs × 2048 steps. This is near the theoretical limit for CartPole on this hardware when including storage and GAE.
SB3 plateaus at ~256K trans/s. TorchRL at ~19K trans/s.
Source: benchmarks/bench_e2e_rollout.py