Top 10 RL Algorithms (2026) — Research Survey¶

A comprehensive survey of the most widely used reinforcement learning algorithms as of early 2026, covering classic RL, LLM post-training, model-based methods, offline RL, and multi-agent systems.

Algorithm Index¶

#	Algorithm	Document	Category
1	PPO	Proximal Policy Optimization	Classic RL / LLM
2	SAC	Soft Actor-Critic	Continuous Control
3	DQN/Rainbow	Deep Q-Network + Improvements	Discrete Control
4	TD3	Twin Delayed DDPG	Continuous Control
5	GRPO	Group Relative Policy Optimization	LLM Reasoning
6	DPO	Direct Preference Optimization	LLM Alignment
7	RLHF-PPO	RL from Human Feedback	LLM Alignment
8	Dreamer	World Model RL (v1/v2/v3)	Model-Based
9	Decision Transformer	RL via Sequence Modeling	Offline RL
10	MAPPO	Multi-Agent PPO	Multi-Agent

Taxonomy¶

Reinforcement Learning Algorithms (2026)
├── Classic RL
│   ├── On-Policy, Model-Free
│   │   ├── PPO (continuous + discrete)
│   │   └── MAPPO (multi-agent)
│   ├── Off-Policy, Model-Free
│   │   ├── DQN / Rainbow (discrete)
│   │   ├── SAC (continuous, max-entropy)
│   │   └── TD3 (continuous, deterministic)
│   ├── Model-Based
│   │   └── Dreamer v1/v2/v3 (world model + imagination)
│   └── Offline / Sequence Modeling
│       └── Decision Transformer
├── LLM Post-Training
│   ├── Online RL
│   │   ├── RLHF-PPO (reward model + PPO)
│   │   └── GRPO (critic-free, group-relative)
│   └── Offline / Preference-Based
│       └── DPO (supervised from preferences)

Comparative Summary¶

Algorithm	On/Off-Policy	Actions	Sample Eff.	Memory	Primary Domain
PPO	On	Both	Low	Low	General RL, LLM
SAC	Off	Continuous	High	Medium	Robotics, Control
DQN/Rainbow	Off	Discrete	Medium	Medium	Games, Discrete
TD3	Off	Continuous	High	Medium	Control
GRPO	On	Token seq.	Low	High*	LLM Reasoning
DPO	Offline	Token seq.	N/A	Medium*	LLM Alignment
RLHF-PPO	On	Token seq.	Low	Very High*	LLM Alignment
Dreamer	Both	Both	Very High	Medium	Visual Control
DT	Offline	Both	N/A	Low	Offline RL
MAPPO	On	Both	Low	Low-Med	Multi-Agent

*Memory for LLM algorithms is dominated by model parameters (billions), not buffers.

Pros and Cons¶

PPO¶

Pros: Simple, parallelizable, works for both classic RL and LLMs, well-tested in production
Cons: Sample inefficient, implementation-detail sensitive, requires critic (expensive for LLMs)

SAC¶

Pros: Sample efficient (off-policy), automatic exploration (entropy), robust policies
Cons: Continuous actions only, replay buffer memory, not easily parallelizable

DQN/Rainbow¶

Pros: Strong for discrete actions, well-understood, many improvements available
Cons: Discrete only, memory-intensive, slow convergence, base DQN now outdated

TD3¶

Pros: Simple, stable, slightly cheaper than SAC, good for deterministic control
Cons: No inherent exploration, generally slightly worse than SAC, noise tuning required

GRPO¶

Pros: No critic needed (~33% memory savings), excels at reasoning, verifiable rewards
Cons: Expensive generation (G completions/prompt), noisy with small groups, on-policy

DPO¶

Pros: Extremely simple (supervised learning), no RL loop, cheap per iteration
Cons: Offline only (base), distribution shift, length exploitation, preference data expensive

RLHF-PPO¶

Pros: Proven at scale (ChatGPT, Claude), flexible reward modeling
Cons: 4 models in memory, extreme complexity, being displaced by GRPO/DPO

Dreamer¶

Pros: Excellent sample efficiency (10-100×), single hyperparameter set (V3), works from pixels
Cons: Compounding model errors, complex implementation, wall-clock can be worse

Decision Transformer¶

Pros: Conceptually elegant, leverages Transformer scaling, no TD instabilities
Cons: Cannot stitch trajectories, offline only, underperforms CQL/IQL on many tasks

MAPPO¶

Pros: Surprisingly effective, simple (just PPO), scales to dozens of agents
Cons: Cooperative only, centralized critic scales poorly, credit assignment hard

When to Use Which¶

By Domain¶

Domain	Recommended	Alternative
Continuous control / robotics	SAC	TD3, PPO
Discrete actions / games	DQN/Rainbow	PPO
LLM reasoning (math, code)	GRPO	RLHF-PPO
LLM alignment (preferences)	DPO → GRPO	RLHF-PPO
Visual control from pixels	Dreamer	Rainbow, PPO
Multi-agent cooperative	MAPPO	QMIX
Offline data only	DT / CQL / IQL	DPO (for LLMs)
Sim-to-real robotics	PPO	SAC

By Constraint¶

Constraint	Best Choice	Why
Minimal env interactions	SAC or Dreamer	Off-policy / model-based reuse
Massive parallelism	PPO / MAPPO	Embarrassingly parallel rollouts
Limited memory	PPO	No replay buffer, small overhead
Simple implementation	DPO (LLM) / PPO (classic)	Fewest moving parts
No reward model available	DPO	Works from preference pairs directly
Verifiable rewards (math/code)	GRPO	Designed for programmatic rewards
Production deployment	PPO or SAC	Best framework support, most battle-tested

Decision Flowchart¶

Is this an LLM task?
├── Yes
│   ├── Do you have preference pairs? → DPO (+ Online DPO for better results)
│   ├── Do you have verifiable rewards? → GRPO
│   └── Do you need a learned reward model? → RLHF-PPO (or GRPO + ORM)
└── No
    ├── Is it multi-agent? → MAPPO
    ├── Discrete actions? → DQN/Rainbow
    ├── Continuous actions?
    │   ├── Need sample efficiency? → SAC (or Dreamer if from pixels)
    │   └── Need massive parallelism? → PPO
    └── Only offline data? → Decision Transformer / CQL / IQL

The 2025-2026 Landscape Shift¶

GRPO has displaced PPO for LLM reasoning. DeepSeek-R1's success (Jan 2025) showed critic-free RL with verifiable rewards can elicit emergent reasoning. By early 2026, GRPO and variants (DAPO, Dr. GRPO, RLOO) are the default.
DPO remains dominant for preference alignment but increasingly combined with online methods. The SFT → DPO → GRPO pipeline is becoming standard.
PPO retains its position for classic RL (robotics, control, games). No algorithm has displaced it for general-purpose on-policy training.
SAC has won the off-policy continuous control competition over TD3 due to automatic entropy tuning.
DreamerV3 demonstrated single-hyperparameter viability across diverse domains, advancing model-based RL from niche to practical.
Decision Transformer has been more influential conceptually (bridging LLM and RL) than practically. The "foundation model for control" vision is still developing.
Multi-agent RL is growing, with MAPPO as the go-to baseline that is surprisingly hard to beat.

References¶

[1]  Schulman et al., "Proximal Policy Optimization Algorithms," arXiv:1707.06347, 2017.
[2]  Schulman et al., "Trust Region Policy Optimization," ICML, 2015.
[3]  Huang et al., "The 37 Implementation Details of PPO," ICLR Blog, 2022.
[4]  Espeholt et al., "IMPALA," ICML, 2018.
[5]  Cobbe et al., "Phasic Policy Gradient," ICML, 2021.
[6]  Haarnoja et al., "Soft Actor-Critic," ICML, 2018.
[7]  Haarnoja et al., "SAC Algorithms and Applications," arXiv:1812.05905, 2018.
[8]  Chen et al., "REDQ," ICLR, 2021.
[9]  Hiraoka et al., "DroQ," ICLR, 2022.
[10] Bhatt et al., "CrossQ," ICLR, 2024.
[11] Mnih et al., "Human-level control through deep RL," Nature, 2015.
[12] Hessel et al., "Rainbow," AAAI, 2018.
[13] van Hasselt et al., "Double Q-learning," AAAI, 2016.
[14] Wang et al., "Dueling Networks," ICML, 2016.
[15] Bellemare et al., "Distributional RL," ICML, 2017.
[16] Schaul et al., "Prioritized Experience Replay," ICLR, 2016.
[17] Dabney et al., "QR-DQN," AAAI, 2018.
[18] Dabney et al., "IQN," ICML, 2018.
[19] Badia et al., "Agent57," ICML, 2020.
[20] Schwarzer et al., "BBF," ICML, 2023.
[21] Fujimoto et al., "TD3," ICML, 2018.
[22] Lillicrap et al., "DDPG," ICLR, 2016.
[23] Fujimoto et al., "TD7," ICML, 2023.
[24] Silver et al., "DPG," ICML, 2014.
[25] Shao et al., "DeepSeekMath," arXiv:2402.03300, 2024.
[26] Guo et al., "DeepSeek-R1," arXiv:2501.12948, 2025.
[27] Lightman et al., "Let's Verify Step by Step," ICLR, 2024.
[28] Liu et al., "DAPO," arXiv, 2025.
[29] Liu et al., "Understanding R1-Zero-Like Training," arXiv, 2025.
[30] Ahmadian et al., "Back to Basics: REINFORCE for RLHF," ACL, 2024.
[31] Williams, "REINFORCE," Machine Learning, 1992.
[32] Rafailov et al., "DPO," NeurIPS, 2023.
[33] Tunstall et al., "Zephyr," arXiv:2310.16944, 2023.
[34] Ivison et al., "Tulu 2," arXiv:2311.10702, 2023.
[35] Azar et al., "IPO," AISTATS, 2024.
[36] Ethayarajh et al., "KTO," ICML, 2024.
[37] Hong et al., "ORPO," EMNLP, 2024.
[38] Meng et al., "SimPO," NeurIPS, 2024.
[39] Guo et al., "Online AI Feedback," arXiv:2402.04792, 2024.
[40] Liu et al., "RSO," ICLR, 2024.
[41] Wu et al., "SPPO," ICML, 2024.
[42] Munos et al., "Nash Learning from Human Feedback," ICML, 2024.
[43] Christiano et al., "Deep RL from Human Preferences," NeurIPS, 2017.
[44] Ouyang et al., "InstructGPT," NeurIPS, 2022.
[45] Stiennon et al., "Learning to summarize from human feedback," NeurIPS, 2020.
[46] Bai et al., "Constitutional AI," arXiv:2212.08073, 2022.
[47] Dong et al., "RAFT," TMLR, 2023.
[48] Hafner et al., "DreamerV1," ICLR, 2020.
[49] Hafner et al., "DreamerV2," ICLR, 2021.
[50] Hafner et al., "DreamerV3," arXiv:2301.04104, 2023.
[51] Hafner et al., "PlaNet," ICML, 2019.
[52] Hansen et al., "TD-MPC2," ICLR, 2024.
[53] Micheli et al., "IRIS," ICML, 2023.
[54] Alonso et al., "DIAMOND," NeurIPS, 2024.
[55] Schrittwieser et al., "MuZero," Nature, 2020.
[56] Chen et al., "Decision Transformer," NeurIPS, 2021.
[57] Janner et al., "Trajectory Transformer," NeurIPS, 2021.
[58] Reed et al., "Gato," arXiv:2205.06175, 2022.
[59] Zheng et al., "Online Decision Transformer," ICML, 2022.
[60] Wu et al., "Elastic Decision Transformer," NeurIPS, 2023.
[61] Yamagata et al., "QDT," arXiv:2209.03993, 2022.
[62] Kumar et al., "CQL," NeurIPS, 2020.
[63] Kostrikov et al., "IQL," ICLR, 2022.
[64] Fujimoto et al., "BCQ," ICML, 2019.
[65] Yu et al., "MAPPO," NeurIPS, 2022.
[66] Rashid et al., "QMIX," ICML, 2018.
[67] Lowe et al., "MADDPG," NeurIPS, 2017.
[68] Wen et al., "MAT," NeurIPS, 2022.
[69] Kuba et al., "HAPPO," ICLR, 2022.

Survey compiled March 2026. Algorithm rankings reflect usage in research publications, open-source frameworks, and documented industry deployments through early 2026.