Skip to content

Convergence Benchmark Results

27/32 Experiments Complete — v6 Re-benchmark Pending

Benchmark v5 ran on GCP with 27 of 32 planned experiments completed. Missing: TD3 Hopper-v4, TD3 Walker2d-v4, SAC Walker2d-v4 (SB3), A2C Acrobot-v1, DQN Acrobot-v1. Six convergence bugs were identified and fixed in v0.3.0/v1.0.0 (see Known Issues below). A v6 re-benchmark will validate these fixes with multi-seed runs and IQM statistics.

Methodology

  • Frameworks: rlox v0.2.3 vs Stable-Baselines3 (v6 will use rlox v1.0.0)
  • Hardware: e2-standard-8 (8 vCPU, 32GB RAM), CPU-only
  • Environments: CartPole-v1, Pendulum-v1, HalfCheetah-v4, Hopper-v4, Walker2d-v4, Acrobot-v1, MountainCar-v0
  • Algorithms: PPO, SAC, TD3, DQN, A2C
  • Evaluation: 30 episodes every 10K steps, deterministic policy
  • Seeds: Single seed (multi-seed planned for next run)

Convergence Results

Algorithm Environment Steps rlox Return SB3 Return Winner
PPO CartPole-v1 100K 453.9 420.9 rlox
PPO Acrobot-v1 500K -88.5 -118.1 rlox
PPO HalfCheetah-v4 1M 4225.6 3142.5 rlox
PPO Hopper-v4 1M 628.1 3577.5 SB3
PPO Walker2d-v4 2M 5007.4 4384.3 rlox
SAC Pendulum-v1 50K -168.5 -167.1 Tie
SAC HalfCheetah-v4 1M 11468.1 10562.7 rlox
SAC Hopper-v4 1M 3290.6 3170.2 rlox
SAC Walker2d-v4 2M 4978.0 -- rlox*
TD3 Pendulum-v1 50K -162.7 -169.4 Tie
TD3 HalfCheetah-v4 1M 10400.4 9899.3 rlox
DQN CartPole-v1 100K 164.8 195.8 SB3
DQN MountainCar-v0 500K -178.7 -109.5 SB3
A2C CartPole-v1 100K 53.8 500.0 SB3

* SB3 experiment not yet completed for this pair.

Speed Comparison

Algorithm Environment rlox SPS SB3 SPS Speedup
PPO CartPole-v1 1,691 687 2.46x
PPO Acrobot-v1 2,520 1,306 1.93x
PPO HalfCheetah-v4 800 437 1.83x
PPO Hopper-v4 1,237 770 1.61x
PPO Walker2d-v4 931 762 1.22x
SAC Pendulum-v1 46 42 1.11x
SAC HalfCheetah-v4 42 63 0.68x
SAC Hopper-v4 77 66 1.18x
SAC Walker2d-v4 75 -- --
TD3 Pendulum-v1 76 65 1.17x
TD3 HalfCheetah-v4 117 101 1.16x
DQN CartPole-v1 462 642 0.72x
DQN MountainCar-v0 479 634 0.76x
A2C CartPole-v1 2,028 489 4.15x

Known Issues (Fixed in v0.3.0 / v1.0.0)

All six convergence bugs identified during v5 benchmarking have been fixed. The v6 re-benchmark will validate these fixes.

Bug Fix (v0.3.0) Affected Results
Truncation bootstrap missing V(terminal_obs) bootstrap for truncated episodes PPO Hopper (628 vs 3577)
Scalar obs normalization Per-dimension Welford stats via RunningStatsVec All MuJoCo envs
Raw reward normalization Return-based std (SB3 convention) All normalized envs
Train/collect obs mismatch Consistent normalization via VecNormalize wrapper All normalized envs
A2C advantage normalization Default changed to False for small batches A2C CartPole (54 vs 500)
log_std init = -0.5 Changed to 0.0 (std=1.0, matching SB3) All continuous envs

Pre-fix notes (v5 results above)

  • PPO Hopper gap (628 vs 3577): Truncation bootstrap + normalization bugs. Fixed.
  • A2C CartPole instability (54 vs 500): Advantage normalization default. Fixed.
  • DQN underperformance: DQN results lag behind SB3 on both CartPole and MountainCar; under investigation.

Candle Hybrid Collection Benchmark

Measured on Apple M-series, CartPole-v1, PPO (n_steps=128, n_epochs=4, hidden=64):

n_envs Hybrid SPS Standard SPS Speedup Collection %
4 32,460 18,779 1.73x 45.6%
8 40,020 23,037 1.74x 41.2%
16 47,863 32,204 1.49x 30.7%
32 53,721 42,748 1.26x 23.4%

The speedup is strongest at lower env counts (4-8 envs: 1.7x) where per-step Python dispatch overhead (~113us) dominates. With more envs, PyTorch's BLAS amortizes the overhead, narrowing the gap.

The Candle hybrid approach eliminates Python dispatch overhead during collection, shifting the bottleneck entirely to the PyTorch training backward pass.

Info

A v6 re-benchmark is planned using rlox v1.0.0 with all six convergence fixes applied. This will include multi-seed runs (5 seeds) with IQM statistics and learning curve plots. Results will be uploaded to gs://rkox-bench-results/convergence-*/.