Skip to content

IQL -- Implicit Q-Learning

Intuition

IQL avoids the main pitfall of offline RL -- querying Q-values for out-of-distribution actions -- by never evaluating Q on actions outside the dataset. Instead of using \(\max_a Q(s,a)\) as the target (which requires querying unseen actions), IQL fits a separate value function \(V(s)\) using expectile regression on Q-values of dataset actions. A high expectile (\(\tau > 0.5\)) biases \(V\) toward the upper quantiles of \(Q(s,a)\) without ever computing Q for actions not in the data. This makes IQL remarkably simple and avoids the need for explicit conservatism penalties.

Key Equations

Value function via expectile regression:

\[ L_V(\psi) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ L_\tau^2\!\left(Q_{\bar\phi}(s,a) - V_\psi(s)\right) \right] \]

where the asymmetric loss is:

\[ L_\tau^2(u) = |\tau - \mathbf{1}(u < 0)| \cdot u^2 \]

Q-function with V as the bootstrap target (no max over actions):

\[ L_Q(\phi) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left(Q_\phi(s,a) - r - \gamma V_\psi(s')\right)^2 \right] \]

Actor extraction via advantage-weighted regression:

\[ L_\pi(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \exp\!\left(\beta \cdot (Q_{\bar\phi}(s,a) - V_\psi(s))\right) \left\| \pi_\theta(s) - a \right\|^2 \right] \]

Pseudocode

algorithm IQL:
    initialize Q1, Q2 (with targets), V, actor pi
    load offline dataset D

    for update = 1 to n_updates do
        sample minibatch from D

        # Value update (expectile regression)
        q_target = min(Q1_target(s,a), Q2_target(s,a))
        L_V = expectile_loss(V(s), q_target, tau)
        update V

        # Q-function update (Bellman with V target)
        target = r + gamma * (1 - done) * V(s')
        L_Q = MSE(Q1(s,a), target) + MSE(Q2(s,a), target)
        update Q1, Q2

        # Soft target update for Q
        polyak_update(Q1_target, Q2_target)

        # Actor update (advantage-weighted regression)
        advantage = q_target - V(s)
        weights = clamp(exp(beta * advantage), max=100)
        L_pi = mean(weights * MSE(pi(s), a))
        update pi

Quick Start

IQL uses the offline algorithm interface:

from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.iql import IQL

dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-medium-v2")
agent = IQL(
    dataset=dataset,
    obs_dim=17,
    act_dim=6,
    expectile=0.7,
    temperature=3.0,
)
metrics = agent.train(n_updates=100_000)

Hyperparameters

Parameter Default Description
expectile 0.7 Expectile \(\tau\) for value function regression (higher = more optimistic)
temperature 3.0 Inverse temperature \(\beta\) for advantage-weighted actor extraction (higher = sharper weighting)
hidden 256 Hidden layer width
learning_rate 3e-4 Learning rate for all networks
tau 0.005 Soft target update rate
gamma 0.99 Discount factor
batch_size 256 Minibatch size

When to Use

  • Use IQL when: you need a simple, stable offline RL algorithm that avoids querying out-of-distribution actions entirely.
  • Prefer IQL over CQL when: CQL's conservatism is too aggressive, or you want fewer hyperparameters to tune.
  • Do not use IQL when: you need to significantly improve beyond the dataset quality (CQL or Cal-QL may be more aggressive in stitching), or you need online fine-tuning (prefer Cal-QL).

References

  • Kostrikov, I., Nair, A., & Levine, S. (2022). Offline Reinforcement Learning with Implicit Q-Learning. ICLR 2022. arXiv:2110.06169.