Convergence Benchmark Results¶

27/32 Experiments Complete — v6 Re-benchmark Pending

Benchmark v5 ran on GCP with 27 of 32 planned experiments completed. Missing: TD3 Hopper-v4, TD3 Walker2d-v4, SAC Walker2d-v4 (SB3), A2C Acrobot-v1, DQN Acrobot-v1. Six convergence bugs were identified and fixed in v0.3.0/v1.0.0 (see Known Issues below). A v6 re-benchmark will validate these fixes with multi-seed runs and IQM statistics.

Methodology¶

Frameworks: rlox v0.2.3 vs Stable-Baselines3 (v6 will use rlox v1.0.0)
Hardware: e2-standard-8 (8 vCPU, 32GB RAM), CPU-only
Environments: CartPole-v1, Pendulum-v1, HalfCheetah-v4, Hopper-v4, Walker2d-v4, Acrobot-v1, MountainCar-v0
Algorithms: PPO, SAC, TD3, DQN, A2C
Evaluation: 30 episodes every 10K steps, deterministic policy
Seeds: Single seed (multi-seed planned for next run)

Convergence Results¶

Algorithm	Environment	Steps	rlox Return	SB3 Return	Winner
PPO	CartPole-v1	100K	453.9	420.9	rlox
PPO	Acrobot-v1	500K	-88.5	-118.1	rlox
PPO	HalfCheetah-v4	1M	4225.6	3142.5	rlox
PPO	Hopper-v4	1M	628.1	3577.5	SB3
PPO	Walker2d-v4	2M	5007.4	4384.3	rlox
SAC	Pendulum-v1	50K	-168.5	-167.1	Tie
SAC	HalfCheetah-v4	1M	11468.1	10562.7	rlox
SAC	Hopper-v4	1M	3290.6	3170.2	rlox
SAC	Walker2d-v4	2M	4978.0	--	rlox*
TD3	Pendulum-v1	50K	-162.7	-169.4	Tie
TD3	HalfCheetah-v4	1M	10400.4	9899.3	rlox
DQN	CartPole-v1	100K	164.8	195.8	SB3
DQN	MountainCar-v0	500K	-178.7	-109.5	SB3
A2C	CartPole-v1	100K	53.8	500.0	SB3

* SB3 experiment not yet completed for this pair.

Speed Comparison¶

Algorithm	Environment	rlox SPS	SB3 SPS	Speedup
PPO	CartPole-v1	1,691	687	2.46x
PPO	Acrobot-v1	2,520	1,306	1.93x
PPO	HalfCheetah-v4	800	437	1.83x
PPO	Hopper-v4	1,237	770	1.61x
PPO	Walker2d-v4	931	762	1.22x
SAC	Pendulum-v1	46	42	1.11x
SAC	HalfCheetah-v4	42	63	0.68x
SAC	Hopper-v4	77	66	1.18x
SAC	Walker2d-v4	75	--	--
TD3	Pendulum-v1	76	65	1.17x
TD3	HalfCheetah-v4	117	101	1.16x
DQN	CartPole-v1	462	642	0.72x
DQN	MountainCar-v0	479	634	0.76x
A2C	CartPole-v1	2,028	489	4.15x

Known Issues (Fixed in v0.3.0 / v1.0.0)¶

All six convergence bugs identified during v5 benchmarking have been fixed. The v6 re-benchmark will validate these fixes.

Bug	Fix (v0.3.0)	Affected Results
Truncation bootstrap missing	V(terminal_obs) bootstrap for truncated episodes	PPO Hopper (628 vs 3577)
Scalar obs normalization	Per-dimension Welford stats via `RunningStatsVec`	All MuJoCo envs
Raw reward normalization	Return-based std (SB3 convention)	All normalized envs
Train/collect obs mismatch	Consistent normalization via `VecNormalize` wrapper	All normalized envs
A2C advantage normalization	Default changed to False for small batches	A2C CartPole (54 vs 500)
log_std init = -0.5	Changed to 0.0 (std=1.0, matching SB3)	All continuous envs

Pre-fix notes (v5 results above)¶

PPO Hopper gap (628 vs 3577): Truncation bootstrap + normalization bugs. Fixed.
A2C CartPole instability (54 vs 500): Advantage normalization default. Fixed.
DQN underperformance: DQN results lag behind SB3 on both CartPole and MountainCar; under investigation.

Candle Hybrid Collection Benchmark¶

Measured on Apple M-series, CartPole-v1, PPO (n_steps=128, n_epochs=4, hidden=64):

n_envs	Hybrid SPS	Standard SPS	Speedup	Collection %
4	32,460	18,779	1.73x	45.6%
8	40,020	23,037	1.74x	41.2%
16	47,863	32,204	1.49x	30.7%
32	53,721	42,748	1.26x	23.4%

The speedup is strongest at lower env counts (4-8 envs: 1.7x) where per-step Python dispatch overhead (~113us) dominates. With more envs, PyTorch's BLAS amortizes the overhead, narrowing the gap.

The Candle hybrid approach eliminates Python dispatch overhead during collection, shifting the bottleneck entirely to the PyTorch training backward pass.

Info

A v6 re-benchmark is planned using rlox v1.0.0 with all six convergence fixes applied. This will include multi-seed runs (5 seeds) with IQM statistics and learning curve plots. Results will be uploaded to gs://rkox-bench-results/convergence-*/.