DPO — Direct Preference Optimization¶

Rafailov et al., "Direct Preference Optimization: Your Language Model is Secretly a Reward Model," NeurIPS, 2023.

Key Idea¶

DPO shows that the optimal policy under the KL-constrained RLHF objective can be expressed in closed form as a function of the reward. Inverting this relationship, DPO reparameterizes the reward in terms of the policy itself, eliminating explicit reward model training or RL optimization. The result is a simple supervised-learning-style loss on preference pairs (chosen vs. rejected). This simplifies the RLHF pipeline from a multi-stage process to a single training step.

Mathematical Formulation¶

Starting from the RLHF objective:

max_π E_{x~D, y~π(·|x)} [ r(x,y) ] - β · D_KL(π || π_ref)

Optimal policy form:

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y) / β)

Reparameterized reward:

r(x,y) = β · log(π_θ(y|x) / π_ref(y|x)) + β · log Z(x)

DPO loss (substituting into Bradley-Terry model):

L_DPO(θ) = -E_{(x, y_w, y_l)~D} [ log σ(
  β · log(π_θ(y_w|x) / π_ref(y_w|x))
  - β · log(π_θ(y_l|x) / π_ref(y_l|x))
) ]

Where y_w = preferred, y_l = dispreferred, σ = sigmoid.

Implicit reward:

r_θ(x,y) = β · log(π_θ(y|x) / π_ref(y|x))

Properties¶

Offline (trains on fixed preference dataset)
No RL loop, no reward model, no value function
Supervised learning formulation
Requires paired preference data (y_w, y_l per prompt)

Key Hyperparameters¶

Parameter	Typical Value	Notes
`β`	0.1–0.5	KL regularization strength
Learning rate	1e-6 to 5e-7	Very low to prevent forgetting
Batch size	32–128	Preference pairs
Epochs	1–3	Overfitting risk with more
Max seq length	512–2048	Task dependent
Label smoothing	0.0–0.1	Optional regularization

Complexity¶

Time: O(B × L × C_forward) — two forward passes per sample (chosen + rejected)
Memory: Policy + reference model (frozen). No critic, no reward model
Reference model can be offloaded or computed via LoRA difference
Much cheaper per iteration than PPO or GRPO (no generation)

Primary Use Cases¶

LLM alignment from human preferences (the "simpler RLHF")
Instruction following (Zephyr, Tulu)
Safety and harmlessness training
Image generation (diffusion model alignment)

Known Limitations¶

Offline only (base form) — cannot explore beyond dataset distribution
Distribution shift — implicit reward unreliable when policy drifts far
Length exploitation — can learn to prefer longer outputs
Cannot use verifiable rewards directly (unlike GRPO)
Quality ceiling imposed by preference data
Preference data expensive to collect at scale
Bradley-Terry assumption may not hold for complex preferences
Decreases entropy/diversity more than PPO-based methods

Major Variants¶

Variant	Reference	Key Change
IPO	Azar et al., AISTATS 2024	Squared loss, robust to noise
KTO	Ethayarajh et al., ICML 2024	Binary feedback only (no pairs)
ORPO	Hong et al., EMNLP 2024	SFT + alignment in one loss, no ref model
SimPO	Meng et al., NeurIPS 2024	Avg log-prob reward, no ref model
Online DPO	Guo et al., 2024	Generate new pairs with current policy
RSO	Liu et al., ICLR 2024	Rejection sampling for on-policy data
SPPO	Wu et al., ICML 2024	Self-play preference optimization

Relationship to Other Algorithms¶

Derived from same objective as RLHF-PPO — different optimization path, same goal
Competes with GRPO: DPO=offline+preferences; GRPO=online+rewards
Can be combined: SFT → DPO → GRPO is an emerging pipeline
Online DPO variants blur the line between DPO and GRPO

Industry Deployment¶

HuggingFace: Zephyr models, TRL library
Meta: Llama fine-tuning recipes
AI2: Tulu models
Anthropic: Explored alongside RLHF
Default choice for alignment when paired preference data is available
As of 2026, increasingly combined with online methods