[{"content":"Today we\u0026rsquo;re open-sourcing rlox, a reinforcement learning framework that applies the Polars architecture pattern to RL: a Rust data plane for the heavy lifting, a Python control plane for everything else.\nThe Problem If you\u0026rsquo;ve trained RL agents with Stable-Baselines3 or TorchRL, you\u0026rsquo;ve probably noticed something frustrating: your GPU sits idle while Python loops through environment steps, shuffles replay buffers, and computes advantages. The GIL turns embarrassingly parallel work into a serial bottleneck.\nThis isn\u0026rsquo;t a Python problem per se — it\u0026rsquo;s an architecture problem. Polars solved the same issue for DataFrames by pushing compute-intensive operations into Rust while keeping the user-facing API in Python. We asked: can the same pattern work for RL?\nThe Polars Pattern Before diving into rlox\u0026rsquo;s architecture, here\u0026rsquo;s the pattern it borrows from. Polars doesn\u0026rsquo;t try to make Python faster — it moves the work out of Python entirely:\ngraph LR subgraph Traditional[\"Traditional RL (SB3 / TorchRL)\"] direction TB P1[Python: env.step] P2[Python: buffer.push] P3[Python: compute GAE] P4[Python: sample batch] P5[Python: optimizer.step] P1 --\u003e P2 --\u003e P3 --\u003e P4 --\u003e P5 end subgraph Polars[\u0026quot;rlox (Polars Pattern)\u0026quot;] direction TB R1[Rust: env.step ∥ Rayon] R2[Rust: buffer.push zero-copy] R3[Rust: compute GAE] R4[Rust: sample batch] PY[Python: optimizer.step] R1 --\u0026gt; R2 --\u0026gt; R3 --\u0026gt; R4 --\u0026gt; PY end Python only runs where it adds value: neural network training via PyTorch. Everything else is Rust.\nThe Architecture The full system has three layers connected by PyO3:\ngraph TB subgraph Python[\"Python Control Plane\"] API[Researcher APItrain / evaluate / sweep] Torch[PyTorchAutograd \u0026 Models] HF[HuggingFaceTransformers \u0026 Datasets] WB[W\u0026B / MLflowLogging] end subgraph Rust[\u0026quot;Rust Data Plane (rlox-core)\u0026quot;] ENV[Environment Engine\u0026lt;br/\u0026gt;parallel stepping via Rayon] BUF[Experience Store\u0026lt;br/\u0026gt;ring, mmap, priority buffers] LOOP[Training Orchestrator\u0026lt;br/\u0026gt;GAE, V-trace, GRPO, batching] SER[Serialization\u0026lt;br/\u0026gt;zero-copy Arrow/numpy] DIST[Distribution Layer\u0026lt;br/\u0026gt;gRPC workers, pipeline] end subgraph Envs[\u0026quot;Environment Backends\u0026quot;] GYM[Gymnasium\u0026lt;br/\u0026gt;via PyO3 bridge] LLM_ENV[LLM Generation\u0026lt;br/\u0026gt;vLLM / TGI / SGLang] CUSTOM[Custom Rust Envs\u0026lt;br/\u0026gt;CartPole built-in] end API --\u0026gt;|PyO3 FFI| ENV API --\u0026gt;|PyO3 FFI| BUF API --\u0026gt;|PyO3 FFI| LOOP Torch \u0026lt;--\u0026gt;|zero-copy tensors| SER HF \u0026lt;--\u0026gt;|tokenized batches| SER ENV --\u0026gt; GYM ENV --\u0026gt; LLM_ENV ENV --\u0026gt; CUSTOM ENV --\u0026gt;|transitions| BUF BUF --\u0026gt;|batches| LOOP LOOP --\u0026gt;|grads request| Torch LOOP \u0026lt;--\u0026gt;|distributed sync| DIST The boundary is deliberate. Everything above the line is where researchers spend their time — algorithm logic, hyperparameter tuning, experiment configs. Everything below is plumbing that should be fast and invisible. PyO3 connects the two with zero-copy where possible.\nData Flow: One Training Step Here\u0026rsquo;s what happens during a single PPO training iteration:\nsequenceDiagram participant P as Python (PPOTrainer) participant R as Rust (rlox-core) participant E as Environments (Rayon) participant T as PyTorch P-\u0026gt;\u0026gt;R: collect_rollout(policy) R-\u0026gt;\u0026gt;E: step_all(actions) [parallel] E--\u0026gt;\u0026gt;R: obs, rewards, dones R-\u0026gt;\u0026gt;R: buffer.push(transitions) R-\u0026gt;\u0026gt;R: compute_gae(rewards, values) R--\u0026gt;\u0026gt;P: RolloutBatch (zero-copy) P-\u0026gt;\u0026gt;T: forward + backward pass T--\u0026gt;\u0026gt;P: gradients P-\u0026gt;\u0026gt;P: optimizer.step() P-\u0026gt;\u0026gt;P: log metrics, callbacks The critical insight: Rust handles steps 2-6 (the data plane) as a single fused operation. There\u0026rsquo;s no Python interpreter overhead between env stepping, buffer storage, and advantage computation — it\u0026rsquo;s one Rust call that returns a ready-to-train batch.\nCrate Architecture The Rust side is organized as a multi-crate workspace, each with a single responsibility:\ngraph TB subgraph Workspace[\"rlox workspace\"] CORE[rlox-coreenvs, buffers, GAE,V-trace, GRPO, pipeline] NN[rlox-nnActorCritic, QFunction,StochasticPolicy traits] BURN[rlox-burnBurn Autodiff NdArray] CANDLE[rlox-candleCandle CPU inference] GRPC[rlox-grpctonic gRPC workers] PY[rlox-pythonPyO3 bindings] end NN --\u0026gt; BURN NN --\u0026gt; CANDLE CORE --\u0026gt; NN CORE --\u0026gt; GRPC PY --\u0026gt; CORE PY --\u0026gt; NN What\u0026rsquo;s Fast and Why We benchmarked rlox against SB3 and TorchRL on Apple M4 with bootstrap 95% confidence intervals (10,000 resamples). Every result marked below is statistically significant.\nGAE: 140-1,700x faster Generalized Advantage Estimation is a sequential backward scan — the kind of workload where Python\u0026rsquo;s interpreter overhead dominates. rlox runs it as a tight Rust loop:\nTrajectory rlox NumPy Loop TorchRL vs NumPy vs TorchRL 128 steps 0.7 us 34 us 453 us 51x 679x 2,048 steps 4.0 us 558 us 6,798 us 139x 1,700x 32,768 steps 60 us 8,906 us 108,441 us 147x 1,791x Buffers: 10-148x faster Replay buffers are the RL equivalent of DataFrame append + sample. rlox uses pre-allocated ring buffers with ChaCha8 RNG:\nOperation rlox TorchRL SB3 vs TorchRL vs SB3 Push 10K transitions 1.5 ms 229 ms 15 ms 148x 9.7x Sample batch=1024 9.2 us 96 us 75 us 10x 8.1x End-to-End: 3.9-53x faster The advantages compound across the pipeline — step, store, compute GAE:\nConfig rlox SB3 TorchRL vs SB3 vs TorchRL 256 envs x 2048 steps 539 ms 2,080 ms 28,432 ms 3.9x 53x Convergence: Same Rewards, Faster Wall-Clock Raw throughput doesn\u0026rsquo;t matter if the agent doesn\u0026rsquo;t learn. We ran PPO and A2C with identical hyperparameters (rl-zoo3 defaults), 5 seeds each:\nAlgorithm Environment rlox Wall-clock SB3 Wall-clock Speedup PPO CartPole-v1 1.6s 5.2s 3.3x A2C CartPole-v1 1.8s 2.1s 1.2x PPO Acrobot-v1 6.4s 9.1s 1.4x Both frameworks converge to the same reward thresholds — rlox just gets there faster because the data plane isn\u0026rsquo;t waiting on Python.\nTraining Throughput (Steps Per Second) On-policy algorithms (PPO, A2C) show 1.6-2.5x SPS improvements thanks to Rust GAE. Off-policy algorithms (SAC, TD3) are bottlenecked by single-env stepping and NN updates, as expected.\nLearning Curves PPO on CartPole-v1 — rlox converges to the same reward, 3.3x faster wall-clock:\nPPO on Acrobot-v1 — both converge to ~-83, rlox reaches threshold 1.4x faster:\nA2C on CartPole-v1 — matched convergence, rlox 2.5x faster throughput:\nPerformance Profile (Agarwal et al., 2021) Aggregated across all environments. On the on-policy subset (PPO, A2C), rlox matches SB3\u0026rsquo;s convergence while training 1.4-3.3x faster.\nBeyond Classic RL: LLM Post-Training rlox isn\u0026rsquo;t just for CartPole. We built first-class support for LLM post-training:\nGRPO and DPO with Rust-accelerated advantage computation (35x faster than NumPy/PyTorch) Token-level KL divergence computed in Rust Sequence packing for efficient batching vLLM, TGI, and SGLang inference backends with a unified factory interface Multi-GPU training via PyTorch DDP composition from rlox.algorithms import GRPO def math_reward(completions, prompts): return [1.0 if verify_answer(c) else 0.0 for c in completions] grpo = GRPO(model=my_llm, ref_model=ref_llm, reward_fn=math_reward) grpo.train(prompts, n_epochs=3) The Rust Crate Ecosystem rlox is a multi-crate Rust workspace, published on crates.io:\nrlox-core — environments, buffers, GAE, V-trace, GRPO, pipeline orchestration rlox-nn — RL algorithm traits (ActorCritic, QFunction, StochasticPolicy) rlox-burn — Burn backend for pure-Rust training rlox-candle — Candle backend for low-latency CPU inference You can use these crates independently in Rust projects without Python at all.\nGetting Started pip install rlox Train PPO on CartPole:\nfrom rlox.trainers import PPOTrainer trainer = PPOTrainer(env=\u0026#34;CartPole-v1\u0026#34;, seed=42) metrics = trainer.train(total_timesteps=50_000) print(f\u0026#34;Mean reward: {metrics[\u0026#39;mean_reward\u0026#39;]:.1f}\u0026#34;) Or use the Rust primitives directly for maximum control:\nimport rlox advantages, returns = rlox.compute_gae( rewards, values, dones, last_value, gamma=0.99, lam=0.95 ) env = rlox.VecEnv(n=256, seed=42, env_id=\u0026#34;CartPole-v1\u0026#34;) result = env.step_all(actions) What\u0026rsquo;s Next More convergence benchmarks across MuJoCo and Atari environments GPU-accelerated environment stepping Broader LLM post-training coverage (online DPO, RLAIF pipelines) Community-contributed Rust environments Links GitHub: github.com/riserally/rlox PyPI: pypi.org/project/rlox crates.io: crates.io/crates/rlox-core Docs: riserally.github.io/rlox License: MIT or Apache 2.0 We\u0026rsquo;d love to hear from you — open an issue, start a discussion, or try pip install rlox and let us know what you think.\nCitation If you use rlox in your research, please cite:\n@software{kowalinski2026rlox, author = {Kowalinski, Wojciech}, title = {rlox: Rust-Accelerated Reinforcement Learning}, year = {2026}, url = {https://github.com/riserally/rlox}, version = {1.0.0}, license = {MIT OR Apache-2.0} } ","permalink":"https://riserally.github.io/rlox/blog/posts/introducing-rlox/","summary":"\u003cp\u003eToday we\u0026rsquo;re open-sourcing \u003cstrong\u003erlox\u003c/strong\u003e, a reinforcement learning framework that applies the \u003ca href=\"https://pola.rs/\"\u003ePolars architecture pattern\u003c/a\u003e to RL: a Rust data plane for the heavy lifting, a Python control plane for everything else.\u003c/p\u003e\n\u003ch2 id=\"the-problem\"\u003eThe Problem\u003c/h2\u003e\n\u003cp\u003eIf you\u0026rsquo;ve trained RL agents with Stable-Baselines3 or TorchRL, you\u0026rsquo;ve probably noticed something frustrating: your GPU sits idle while Python loops through environment steps, shuffles replay buffers, and computes advantages. The GIL turns embarrassingly parallel work into a serial bottleneck.\u003c/p\u003e","title":"Introducing rlox: Rust-Accelerated Reinforcement Learning"}]