IQL -- Implicit Q-Learning¶
Intuition¶
IQL avoids the main pitfall of offline RL -- querying Q-values for out-of-distribution actions -- by never evaluating Q on actions outside the dataset. Instead of using \(\max_a Q(s,a)\) as the target (which requires querying unseen actions), IQL fits a separate value function \(V(s)\) using expectile regression on Q-values of dataset actions. A high expectile (\(\tau > 0.5\)) biases \(V\) toward the upper quantiles of \(Q(s,a)\) without ever computing Q for actions not in the data. This makes IQL remarkably simple and avoids the need for explicit conservatism penalties.
Key Equations¶
Value function via expectile regression:
where the asymmetric loss is:
Q-function with V as the bootstrap target (no max over actions):
Actor extraction via advantage-weighted regression:
Pseudocode¶
algorithm IQL:
initialize Q1, Q2 (with targets), V, actor pi
load offline dataset D
for update = 1 to n_updates do
sample minibatch from D
# Value update (expectile regression)
q_target = min(Q1_target(s,a), Q2_target(s,a))
L_V = expectile_loss(V(s), q_target, tau)
update V
# Q-function update (Bellman with V target)
target = r + gamma * (1 - done) * V(s')
L_Q = MSE(Q1(s,a), target) + MSE(Q2(s,a), target)
update Q1, Q2
# Soft target update for Q
polyak_update(Q1_target, Q2_target)
# Actor update (advantage-weighted regression)
advantage = q_target - V(s)
weights = clamp(exp(beta * advantage), max=100)
L_pi = mean(weights * MSE(pi(s), a))
update pi
Quick Start¶
IQL uses the offline algorithm interface:
from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.iql import IQL
dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-medium-v2")
agent = IQL(
dataset=dataset,
obs_dim=17,
act_dim=6,
expectile=0.7,
temperature=3.0,
)
metrics = agent.train(n_updates=100_000)
Hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
expectile |
0.7 |
Expectile \(\tau\) for value function regression (higher = more optimistic) |
temperature |
3.0 |
Inverse temperature \(\beta\) for advantage-weighted actor extraction (higher = sharper weighting) |
hidden |
256 |
Hidden layer width |
learning_rate |
3e-4 |
Learning rate for all networks |
tau |
0.005 |
Soft target update rate |
gamma |
0.99 |
Discount factor |
batch_size |
256 |
Minibatch size |
When to Use¶
- Use IQL when: you need a simple, stable offline RL algorithm that avoids querying out-of-distribution actions entirely.
- Prefer IQL over CQL when: CQL's conservatism is too aggressive, or you want fewer hyperparameters to tune.
- Do not use IQL when: you need to significantly improve beyond the dataset quality (CQL or Cal-QL may be more aggressive in stitching), or you need online fine-tuning (prefer Cal-QL).
References¶
- Kostrikov, I., Nair, A., & Levine, S. (2022). Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022. arXiv:2110.06169.