References¶

This document lists the academic papers that rlox implements or builds upon. References are numbered to match citations in the Mathematical Reference.

Core Algorithms¶

[1] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, "Proximal Policy Optimization Algorithms," arXiv preprint arXiv:1707.06347, 2017. - Clipped surrogate objective (PPO-Clip) - Implemented in: rlox.algorithms.ppo.PPO, rlox.losses.PPOLoss

[2] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, "High-Dimensional Continuous Control Using Generalized Advantage Estimation," in Proc. ICLR, 2016. - GAE: exponentially-weighted average of multi-step TD errors - Implemented in: rlox_core::training::gae::compute_gae, rlox.compute_gae

[3] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor," in Proc. ICML, 2018, pp. 1861--1870. - Entropy-regularised RL, twin Q-networks, squashed Gaussian policy - Implemented in: rlox.algorithms.sac.SAC

[4] S. Fujimoto, H. van Hoof, and D. Meger, "Addressing Function Approximation Error in Actor-Critic Methods," in Proc. ICML, 2018, pp. 1587--1596. - TD3: twin critics, delayed policy updates, target policy smoothing - Implemented in: rlox.algorithms.td3.TD3

[5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, pp. 529--533, 2015. - DQN: deep Q-learning with experience replay and target networks - Implemented in: rlox.algorithms.dqn.DQN

DQN Extensions (Rainbow Components)¶

[6] H. van Hasselt, A. Guez, and D. Silver, "Deep Reinforcement Learning with Double Q-learning," in Proc. AAAI, 2016, pp. 2094--2100. - Double DQN: decoupled action selection and evaluation - Implemented in: rlox.algorithms.dqn.DQN (double_dqn=True)

[7] Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, "Dueling Network Architectures for Deep Reinforcement Learning," in Proc. ICML, 2016, pp. 1995--2003. - Dueling architecture: separate value and advantage streams - Implemented in: rlox.networks.DuelingQNetwork, rlox.algorithms.dqn.DQN (dueling=True)

[8] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, "Prioritized Experience Replay," in Proc. ICLR, 2016. - Sum-tree based proportional prioritisation with importance-sampling correction - Implemented in: rlox_core::buffer::priority::PrioritizedReplayBuffer, rlox.PrioritizedReplayBuffer

Distributed and Off-Policy Correction¶

[9] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu, "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures," in Proc. ICML, 2018, pp. 1407--1416. - V-trace: clipped importance weight off-policy correction - Implemented in: rlox_core::training::vtrace::compute_vtrace, rlox.compute_vtrace

LLM Post-Training¶

[10] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," in Proc. NeurIPS, 2023. - DPO: bypasses reward modelling by directly optimising from preferences - Implemented in: rlox.algorithms.dpo.DPO, rlox_core::llm::ops::DPOPair

[11] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models," arXiv preprint arXiv:2402.03300, 2024. - GRPO: group-relative policy optimization, eliminates learned value baseline - Implemented in: rlox_core::llm::ops::compute_group_advantages, rlox.algorithms.grpo.GRPO

Multi-Agent RL¶

[17] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Baez, B. Awbi, et al., "The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games," in Proc. NeurIPS, 2022. - MAPPO: PPO with centralized critic for multi-agent cooperation - Implemented in: rlox.algorithms.mappo.MAPPO - https://arxiv.org/abs/2103.01955

Model-Based RL¶

[18] D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, "Mastering Diverse Domains through World Models," arXiv preprint arXiv:2301.04104, 2023. - DreamerV3: world model with actor-critic in latent space - Implemented in: rlox.algorithms.dreamer.DreamerV3 - https://arxiv.org/abs/2301.04104

Offline RL¶

[19] S. Fujimoto and S. S. Gu, "A Minimalist Approach to Offline Reinforcement Learning," in Proc. NeurIPS, 2021. - TD3+BC: TD3 with behavioral cloning regularization - Implemented in: rlox.algorithms.td3_bc.TD3BC - https://arxiv.org/abs/2106.06860

[20] I. Kostrikov, A. Nair, and S. Levine, "Offline Reinforcement Learning with Implicit Q-Learning," in Proc. ICLR, 2022. - IQL: avoids OOD actions via expectile regression on the value function - Implemented in: rlox.algorithms.iql.IQL - https://arxiv.org/abs/2110.06169

[21] A. Kumar, A. Zhou, G. Tucker, and S. Levine, "Conservative Q-Learning for Offline Reinforcement Learning," in Proc. NeurIPS, 2020. - CQL: penalizes Q-values for out-of-distribution actions - Implemented in: rlox.algorithms.cql.CQL - https://arxiv.org/abs/2006.04779

Imitation Learning¶

[22] M. Bain and C. Sammut, "A Framework for Behavioural Cloning," in Machine Intelligence 15, pp. 103--129, 1995. - Behavioral cloning: supervised learning on expert demonstrations - Implemented in: rlox.algorithms.bc.BC

LLM Alignment (Additional)¶

[23] Z. Guo, A. Rashid, B. Suber, S. Sharma, D. Sui, et al., "Direct Language Model Alignment from Online AI Feedback," arXiv preprint arXiv:2402.04792, 2024. - Online DPO: extends DPO to the online generation setting - Implemented in: rlox.algorithms.online_dpo.OnlineDPO - https://arxiv.org/abs/2402.04792

[24] Y. Nakano, J. Hilton, S. Balaji, J. Wu, et al., "WebGPT: Browser-assisted question-answering with human feedback," arXiv preprint arXiv:2112.09332, 2021. - Best-of-N sampling as an alignment baseline - Implemented in: rlox.algorithms.best_of_n.BestOfN - https://arxiv.org/abs/2112.09332

[25] A. Gao, J. Schulman, and J. Hilton, "Scaling Laws for Reward Model Overoptimization," in Proc. ICML, 2023. - Analysis of best-of-N vs RL fine-tuning overoptimization - https://arxiv.org/abs/2210.10760

Evaluation Methodology¶

[26] R. Agarwal, M. Schwarzer, P. S. Castro, A. Courville, and M. G. Bellemare, "Deep Reinforcement Learning at the Edge of the Statistical Precipice," in Proc. NeurIPS, 2021. - IQM, performance profiles, stratified bootstrap CIs for RL evaluation - Implemented in: rlox.evaluation

Implementation References¶

[27] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, "Asynchronous Methods for Deep Reinforcement Learning," in Proc. ICML, 2016, pp. 1928--1937. - A3C / A2C: synchronous advantage actor-critic - Implemented in: rlox.algorithms.a2c.A2C

[28] M. Andrychowicz, A. Raichuk, P. Stanczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, S. Gelly, and O. Bachem, "What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study," arXiv preprint arXiv:2006.05990, 2021. - Empirical analysis of on-policy RL implementation details (orthogonal init, advantage normalisation, etc.) - Used for: initialisation and training practice choices in rlox.policies

Architecture Inspiration¶

[29] Polars contributors, "Polars: Blazingly fast DataFrames," https://pola.rs, 2024. - Architecture pattern: Rust data plane + Python control plane via PyO3 - rlox applies this pattern to RL: Rust handles environments, buffers, and numerical computation; Python handles training logic and neural networks

[30] H. Huang, S. Dossa, C. Ye, J. Braga, D. Chakraborty, K. Mehta, and J. G. Araujo, "CleanRL: High-quality Single-file Implementations of DeepRL Algorithms," Journal of Machine Learning Research, vol. 23, no. 274, pp. 1--18, 2022. - Reference implementations for PPO, A2C, SAC, TD3, DQN hyperparameters and training practices - rlox's default hyperparameters match CleanRL where applicable