rlox Convergence Benchmark Profiling Report¶
Date: 2026-03-28 Platform: macOS Darwin 25.2.0 (Apple Silicon), Python 3.12, PyTorch 2.10.0 Build: Release (Rust extension compiled via maturin)
1. End-to-End Algorithm Timing¶
PPO on CartPole-v1 (50,000 timesteps, 8 envs)¶
| Metric | Value |
|---|---|
| Total wall time | 2.66 s |
| Init time | 0.43 s |
| Training time | 2.23 s |
| Steps per second (SPS) | 18,796 |
Phase breakdown (TimingCallback):
| Phase | Time (s) | Share |
|---|---|---|
| env_step (collect) | 1.479 | 66.4% |
| gradient_update (SGD) | 0.674 | 30.2% |
| gae_compute | 0.076 | 3.4% |
| Total tracked | 2.229 | 100% |
SAC on Pendulum-v1 (20,000 timesteps)¶
| Metric | Value |
|---|---|
| Total wall time | 94.42 s |
| Init time | 0.29 s |
| SPS | 212 |
| Learning starts | 1,000 steps (random exploration) |
| Effective training steps | 19,000 (with gradient updates) |
2. Component-Level Micro-Benchmarks¶
Rust Primitives (all times in microseconds)¶
| Operation | Latency |
|---|---|
| VecEnv.step_all(64 envs) | 48.4 us |
| ReplayBuffer.push(single) | 0.1 us |
| ReplayBuffer.push_batch(8) | 0.3 us |
| ReplayBuffer.sample(256) | 3.7 us |
| compute_gae_batched(8x128) | 34.7 us |
| compute_batch_token_kl_schulman_f32(32x2048) | 60.6 us |
Verdict: All Rust primitives are extremely fast. The Rust data-plane is not a bottleneck.
PyTorch Operations (PPO, batch=8 envs)¶
| Operation | Per-step latency |
|---|---|
| policy.get_action_and_logprob (8 obs) | 123 us |
| policy.get_value (8 obs) | 19 us |
| VecEnv.step_all (8 envs) | 96 us |
| Tensor conversion (numpy<->torch) | 2 us |
PyTorch Operations (SAC, single env)¶
| Operation | Latency |
|---|---|
| Gym env.step (Pendulum) | 15.2 us |
| actor.sample (single obs) | 49.8 us |
| ReplayBuffer.push | 0.5 us |
| ReplayBuffer.sample(256) | 4.0 us |
| Full _update (1 gradient step) | 4,790 us |
SAC _update() internal breakdown:
| Component | Latency | Share |
|---|---|---|
| Actor fwd+bwd+step | 2,070 us | 44.0% |
| Critic fwd+bwd+step (both) | 1,683 us | 35.8% |
| Target Q (no_grad) | 709 us | 15.1% |
| Polyak soft update | 197 us | 4.2% |
| Alpha update | 39 us | 0.8% |
| Sample + tensor convert | 8 us | 0.2% |
| Total | 4,708 us | 100% |
3. FFI Crossing Analysis¶
PPO: Per Rollout (8 envs x 128 steps = 1,024 timesteps)¶
| FFI Call | Count | Notes |
|---|---|---|
| VecEnv.step_all | 128 | 1 per step |
| policy.get_action_and_logprob | 128 | PyTorch, not FFI per se |
| policy.get_value | 129 | +1 for bootstrap |
| compute_gae_batched | 1 | Single batched call |
| Total Rust FFI calls | 129 | step_all + GAE |
| Total Python<->Rust boundary crossings | ~129 | Negligible overhead at ~2 us each |
PPO's design is already efficient: environment stepping is batched (all 8 envs in one step_all call), and GAE is a single batched FFI call.
SAC: Per Training Step (after learning_starts)¶
| FFI Call | Count | Notes |
|---|---|---|
| gym.env.step | 1 | Python Gymnasium, not Rust |
| ReplayBuffer.push | 1 | ~0.5 us |
| ReplayBuffer.sample | 1 | ~4 us |
| Total Rust FFI calls | 2 | push + sample |
SAC's bottleneck is purely PyTorch gradient computation, not FFI.
4. Time Spent: Rust vs Python vs PyTorch¶
PPO (per rollout of 1,024 timesteps)¶
| Layer | Time (ms) | Share |
|---|---|---|
| Rust (VecEnv.step_all + GAE) | 12.3 | 19.6% |
| PyTorch (policy forward, no grad) | 18.2 | 29.0% |
| PyTorch (SGD: forward+backward+step) | 28.5 | 45.5% |
| Python (tensor conversions, bookkeeping) | 3.7 | 5.9% |
| Total | 62.7 | 100% |
SAC (per training step)¶
| Layer | Time (us) | Share |
|---|---|---|
| Rust (buffer push + sample) | 4.5 | 0.1% |
| Python Gymnasium (env.step) | 15.2 | 0.3% |
| PyTorch inference (actor.sample, no grad) | 49.8 | 1.0% |
| PyTorch training (_update: fwd+bwd+step) | 4,790 | 98.6% |
| Total | ~4,860 | 100% |
5. Bottleneck Analysis¶
PPO Bottlenecks¶
-
Policy forward passes during collection (29%) -- The collector calls
get_action_and_logprobandget_valueseparately in a Python loop, 128 times per rollout. Each call processes only 8 observations (the env batch). The actor forward takes 123 us per call, which is dominated by PyTorch dispatch overhead on such tiny tensors (8x4 input through a 64-hidden MLP). -
SGD phase (45.5%) -- 4 epochs x 2 minibatches = 8 gradient steps per rollout. Each minibatch (512 samples) takes ~3.56 ms. This is healthy PyTorch compute and hard to improve without algorithmic changes.
-
VecEnv stepping is already fast -- 96 us per step_all(8) is good but could be further batched by collecting multiple steps per FFI call.
SAC Bottlenecks¶
-
PyTorch gradient computation (98.6%) -- The _update function dominates. Within it, the actor backward pass (44%) and critic backward pass (36%) are the main costs. This is expected for an off-policy algorithm doing 3 backward passes per step.
-
Single-env stepping -- SAC uses a single Gymnasium (Python) environment. This is not a bottleneck because the gradient step is 300x slower than the env step, but vectorizing SAC data collection could help with future scaling.
-
Polyak update (4.2% of _update) -- Iterates over all parameter pairs in Python. Could be batched with
torch._foreach_lerp_or moved to a fused Rust kernel.
6. Optimization Recommendations¶
High Impact¶
| # | Optimization | Target | Estimated Impact |
|---|---|---|---|
| 1 | Fuse actor+critic forward in collector | PPO collect | Combine get_action_and_logprob + get_value into a single shared-trunk forward pass. Saves ~19 us/step (half the value forward). ~10% wall-time reduction on PPO. |
| 2 | torch.compile the PPO policy | PPO collect + SGD | Apply torch.compile(policy, mode='reduce-overhead'). Eliminates PyTorch dispatch overhead on the small MLPs. On 8-obs batches, dispatch overhead dominates compute. ~15-25% wall-time reduction on PPO. |
| 3 | torch.compile SAC networks | SAC _update | Compile actor, critic1, critic2 with torch.compile. The 256-batch forward/backward passes will benefit from kernel fusion. ~20-30% wall-time reduction on SAC (estimated 40-80 SPS improvement). |
| 4 | Batch multi-step collection in Rust | PPO collect | Instead of calling step_all 128 times from Python, add a collect_n_steps(n) method to VecEnv that returns all 128 steps' data in one FFI call. Eliminates 127 Python<->Rust round-trips and the per-step Python loop overhead. ~5-10% wall-time reduction on PPO. |
Medium Impact¶
| # | Optimization | Target | Estimated Impact |
|---|---|---|---|
| 5 | Use torch._foreach ops for Polyak update |
SAC _update | Replace the for sp, tp in zip(...) loop with torch._foreach_lerp_. Saves Python loop overhead across ~10 parameter tensors. ~3% of SAC _update time. |
| 6 | Vectorize SAC data collection | SAC env step | Run N parallel envs and batch-insert into the replay buffer. Would improve SPS at high timestep counts by amortizing the single-env overhead. Marginal at 20k steps, significant at 1M+. |
| 7 | Eliminate .tolist() in discrete action conversion |
PPO collect | The collector calls .astype(np.uint32).tolist() to convert actions for VecEnv. VecEnv already accepts numpy arrays. Remove the .tolist() call. < 1% improvement (0.5 us/step), but free. |
| 8 | Pre-allocate rollout tensors | PPO collect | The collector builds lists of tensors and calls torch.stack at the end. Pre-allocating a (n_steps, n_envs, ...) tensor and writing into it avoids N small allocations + one stack op. ~2-3% of collect time. |
Low Impact / Future Work¶
| # | Optimization | Target | Notes |
|---|---|---|---|
| 9 | Move PPO SGD inner loop to Rust | PPO SGD | Would eliminate Python overhead in the minibatch loop. Diminishing returns since PyTorch backward is the actual cost. |
| 10 | CUDA/MPS backend | All | Moving to GPU would transform the profile entirely. The tiny-batch overhead that dominates CPU PPO disappears on GPU. |
| 11 | Increase n_envs for PPO | PPO collect | Going from 8 to 32+ envs amortizes per-step PyTorch dispatch overhead across more data. This is a hyperparameter trade-off, not a code change. |
7. Summary¶
The Rust data-plane primitives (VecEnv, ReplayBuffer, GAE, KL divergence) are all extremely fast, typically sub-100-microsecond. They represent less than 1% of SAC wall time and about 20% of PPO wall time (dominated by env stepping, which is genuinely fast).
The primary bottleneck in both algorithms is PyTorch forward/backward computation: - PPO: 74.5% of time is in PyTorch (29% inference during collection + 45.5% SGD) - SAC: 99.6% of time is in PyTorch (98.6% gradient updates + 1% inference)
The highest-ROI optimizations are:
1. torch.compile on all networks (free 15-30% speedup)
2. Fusing actor+critic forward passes in the PPO collector
3. Batching multi-step collection into a single Rust FFI call
No Rust-side optimizations are needed at this scale. The Rust layer is already 10-100x faster than the PyTorch layer it serves.