Skip to content

Benchmark: Buffer Operations

Replay buffer push throughput and sample latency — fundamental operations in every RL training loop.

What is Measured

Push Throughput

Insert 10,000 transitions into a buffer, one at a time (per-transition push, the common RL pattern).

Framework API Buffer Type
rlox ExperienceTable table.push(obs, action, reward, terminated, truncated) Flat Vec<f32> columnar store
TorchRL ReplayBuffer rb.add(td) where td is a TensorDict LazyTensorStorage
SB3 ReplayBuffer buf.add(obs, next_obs, action, reward, done, infos) Pre-allocated NumPy arrays

Two observation sizes tested: - obs_dim=4 (CartPole) — overhead-dominated, stress-tests per-call cost - obs_dim=28,224 (Atari 84×84×4) — memcpy-dominated, tests memory bandwidth

Sample Latency

Draw a random batch from a full buffer (100,000 transitions, obs_dim=4).

Framework API
rlox ReplayBuffer buf.sample(batch_size, seed) — ChaCha8 RNG, contiguous memory
TorchRL ReplayBuffer rb.sample() — torch-based indexing
SB3 ReplayBuffer buf.sample(batch_size)np.random.choice + row indexing

Results

Push Throughput

obs_dim rlox TorchRL SB3 rlox throughput
4 3.59 ms 218.67 ms 14.24 ms 2,782,205 trans/s
28,224 136.01 ms 248.69 ms 109.19 ms 73,522 trans/s

Push Speedup

obs_dim vs TorchRL 95% CI vs SB3 95% CI
4 60.8x [59.6, 62.3] 4.0x [3.9, 4.0]
28,224 1.8x [1.7, 2.0] 0.8x [0.7, 0.9]

Sample Latency

Batch Size rlox (median) rlox (p99) TorchRL (median) TorchRL (p99) SB3 (median) SB3 (p99)
32 1.5 us 1.8 us 17.2 us 24.0 us 17.2 us 24.8 us
64 1.5 us 2.1 us 20.1 us 26.4 us 20.6 us 35.0 us
256 4.2 us 6.2 us 22.2 us 36.3 us 29.3 us 39.4 us
1,024 10.1 us 17.0 us 65.0 us 109.4 us 61.0 us 75.0 us

Sample Speedup

Batch Size vs TorchRL 95% CI vs SB3 95% CI
32 11.2x [10.9, 11.6] 11.2x [10.8, 11.6]
64 13.4x [12.9, 13.6] 13.8x [13.2, 14.0]
256 5.3x [5.0, 5.6] 7.0x [6.7, 7.4]
1,024 6.5x [6.0, 6.8] 6.1x [5.8, 6.3]

Analysis

Push: Why 61x faster than TorchRL for small observations

Each TorchRL rb.add(td) call: 1. Validates the TensorDict schema 2. Converts/copies tensor data into the storage backend 3. Updates internal metadata (indices, counters)

Each rlox table.push() call crosses the PyO3 boundary once, copies the observation via extend_from_slice (a single memcpy), and increments a counter. No Python object creation, no schema validation.

Push: Why SB3 wins at Atari-sized observations

At obs_dim=28,224, each push copies ~110KB of data (28,224 × 4 bytes). The memcpy dominates total time, and both rlox and SB3 hit the same memory bandwidth ceiling. SB3's pre-allocated NumPy arrays avoid reallocation, while rlox's Vec<f32> may trigger occasional reallocations during growth.

Sample: Predictable tail latency

rlox's p99 latency is remarkably low (17.0us for batch=1024 vs 75-109us for TorchRL/SB3). This comes from: - Pre-allocated ring buffer: No heap allocation during sampling - ChaCha8 RNG: Deterministic, cache-friendly random number generation - Contiguous memory layout: Sequential reads from flat arrays, no pointer chasing

Source: benchmarks/bench_buffer_ops.py