Benchmark: GAE Computation¶

Generalized Advantage Estimation — the core advantage calculation used by PPO and most on-policy RL algorithms. GAE is inherently sequential (backward scan with data dependencies), so speedup comes from eliminating Python interpreter overhead rather than parallelism.

What is Measured¶

Each framework computes GAE over a single trajectory of length T: - rlox: compute_gae(rewards, values, dones, last_value, gamma, lam) — Rust backward loop, accepts numpy arrays - TorchRL: generalized_advantage_estimate(gamma, lam, values, next_values, rewards, dones, terminated) — PyTorch C++ kernel via TensorDict - NumPy loop: Python for t in reversed(range(n)) loop — identical to CleanRL and SB3 internals

Parameters: gamma=0.99, lam=0.95, dones sampled at ~5% rate, seed=42.

Correctness¶

All implementations validated against the NumPy reference to within rtol=1e-6, atol=1e-10 before benchmarking.

Results¶

Raw Timings¶

Trajectory Length	rlox (median)	NumPy Loop (median)	TorchRL (median)
128 steps	0.7 us	30.4 us	423.3 us
512 steps	1.2 us	135.4 us	1,600.4 us
2,048 steps	3.9 us	557.4 us	6,402.6 us
8,192 steps	16.6 us	2,166.2 us	25,712.9 us
32,768 steps	54.7 us	8,664.0 us	103,448.2 us

Speedup vs NumPy Loop (Python)¶

Trajectory Length	Speedup	95% CI
128 steps	45.5x	[45.4, 45.7]
512 steps	116.0x	[115.7, 116.2]
2,048 steps	142.3x	[140.7, 148.6]
8,192 steps	130.6x	[119.9, 136.1]
32,768 steps	158.4x	[157.3, 158.6]

Speedup vs TorchRL¶

Trajectory Length	Speedup	95% CI
128 steps	634.7x	[632.1, 636.4]
512 steps	1,371.4x	[1366.9, 1375.5]
2,048 steps	1,634.6x	[1615.6, 1707.4]
8,192 steps	1,550.6x	[1422.9, 1617.6]
32,768 steps	1,890.9x	[1876.8, 1898.7]

Analysis¶

Why rlox is 140x faster than Python loops¶

The GAE backward scan is:

for t in reversed(range(n)):
    delta = rewards[t] + gamma * next_value * non_terminal - values[t]
    last_gae = delta + gamma * lam * non_terminal * last_gae
    advantages[t] = last_gae

In Python, each iteration pays: - Python bytecode dispatch (~50ns) - Float object creation for intermediate results - Array indexing overhead (bounds checking, type dispatch)

In Rust, the same loop compiles to ~5 instructions with no allocation. At 2048 steps, that's 2048 × ~50ns = ~100us of Python overhead eliminated.

Why TorchRL is 1635x slower¶

TorchRL's generalized_advantage_estimate operates on TensorDict objects. Each step in the computation involves: - TensorDict metadata validation - PyTorch tensor operation dispatch (even for scalar ops) - TensorDict key lookups

This per-element overhead dominates the actual arithmetic. TorchRL's GAE is designed for composability within its TensorDict ecosystem, not for raw computation speed.

Scaling behavior¶

The rlox vs NumPy speedup increases from 46x (128 steps) to 158x (32768 steps) because: - At small T, rlox's fixed PyO3 boundary-crossing overhead is a larger fraction of total time - At large T, both converge to their steady-state per-step cost, and the ratio stabilizes around 140x

Source: benchmarks/bench_gae.py