Benchmark: Buffer Operations¶

Replay buffer push throughput and sample latency — fundamental operations in every RL training loop.

What is Measured¶

Push Throughput¶

Insert 10,000 transitions into a buffer, one at a time (per-transition push, the common RL pattern).

Framework	API	Buffer Type
rlox `ExperienceTable`	`table.push(obs, action, reward, terminated, truncated)`	Flat `Vec<f32>` columnar store
TorchRL `ReplayBuffer`	`rb.add(td)` where `td` is a TensorDict	`LazyTensorStorage`
SB3 `ReplayBuffer`	`buf.add(obs, next_obs, action, reward, done, infos)`	Pre-allocated NumPy arrays

Two observation sizes tested: - obs_dim=4 (CartPole) — overhead-dominated, stress-tests per-call cost - obs_dim=28,224 (Atari 84×84×4) — memcpy-dominated, tests memory bandwidth

Sample Latency¶

Draw a random batch from a full buffer (100,000 transitions, obs_dim=4).

Framework	API
rlox `ReplayBuffer`	`buf.sample(batch_size, seed)` — ChaCha8 RNG, contiguous memory
TorchRL `ReplayBuffer`	`rb.sample()` — torch-based indexing
SB3 `ReplayBuffer`	`buf.sample(batch_size)` — `np.random.choice` + row indexing

Results¶

Push Throughput¶

obs_dim	rlox	TorchRL	SB3	rlox throughput
4	3.59 ms	218.67 ms	14.24 ms	2,782,205 trans/s
28,224	136.01 ms	248.69 ms	109.19 ms	73,522 trans/s

Push Speedup¶

obs_dim	vs TorchRL	95% CI	vs SB3	95% CI
4	60.8x	[59.6, 62.3]	4.0x	[3.9, 4.0]
28,224	1.8x	[1.7, 2.0]	0.8x	[0.7, 0.9]

Sample Latency¶

Batch Size	rlox (median)	rlox (p99)	TorchRL (median)	TorchRL (p99)	SB3 (median)	SB3 (p99)
32	1.5 us	1.8 us	17.2 us	24.0 us	17.2 us	24.8 us
64	1.5 us	2.1 us	20.1 us	26.4 us	20.6 us	35.0 us
256	4.2 us	6.2 us	22.2 us	36.3 us	29.3 us	39.4 us
1,024	10.1 us	17.0 us	65.0 us	109.4 us	61.0 us	75.0 us

Sample Speedup¶

Batch Size	vs TorchRL	95% CI	vs SB3	95% CI
32	11.2x	[10.9, 11.6]	11.2x	[10.8, 11.6]
64	13.4x	[12.9, 13.6]	13.8x	[13.2, 14.0]
256	5.3x	[5.0, 5.6]	7.0x	[6.7, 7.4]
1,024	6.5x	[6.0, 6.8]	6.1x	[5.8, 6.3]

Analysis¶

Push: Why 61x faster than TorchRL for small observations¶

Each TorchRL rb.add(td) call: 1. Validates the TensorDict schema 2. Converts/copies tensor data into the storage backend 3. Updates internal metadata (indices, counters)

Each rlox table.push() call crosses the PyO3 boundary once, copies the observation via extend_from_slice (a single memcpy), and increments a counter. No Python object creation, no schema validation.

Push: Why SB3 wins at Atari-sized observations¶

At obs_dim=28,224, each push copies ~110KB of data (28,224 × 4 bytes). The memcpy dominates total time, and both rlox and SB3 hit the same memory bandwidth ceiling. SB3's pre-allocated NumPy arrays avoid reallocation, while rlox's Vec<f32> may trigger occasional reallocations during growth.

Sample: Predictable tail latency¶

rlox's p99 latency is remarkably low (17.0us for batch=1024 vs 75-109us for TorchRL/SB3). This comes from: - Pre-allocated ring buffer: No heap allocation during sampling - ChaCha8 RNG: Deterministic, cache-friendly random number generation - Contiguous memory layout: Sequential reads from flat arrays, no pointer chasing

Source: benchmarks/bench_buffer_ops.py