Skip to content

Cal-QL -- Calibrated Conservative Q-Learning

Intuition

Cal-QL addresses a key limitation of CQL: vanilla CQL applies the same conservative penalty regardless of how close an action is to the training distribution, leading to overly pessimistic Q-values for near-distribution actions. Cal-QL introduces a calibration mechanism that scales the conservative penalty based on the gap between current Q-values and an empirical threshold derived from offline returns. When Q-values are already well-calibrated (close to actual returns), the penalty is reduced. This makes Cal-QL particularly effective for offline pre-training followed by online fine-tuning.

Key Equations

The standard CQL penalty pushes down Q-values for out-of-distribution actions:

\[ L_{\text{CQL}} = \alpha \left( \log \sum_a \exp Q(s,a) - \mathbb{E}_{a \sim \mathcal{D}}[Q(s,a)] \right) \]

Cal-QL adds a calibrated scaling factor:

\[ L_{\text{cal}} = \max\!\left(Q(s,a) - Q_{\text{cal}},\; 0\right) \cdot \tau_{\text{cal}} \cdot \left(Q(s,a) - Q_{\text{data}}(s,a)\right) \]

where \(Q_{\text{cal}}\) is the \(\tau\)-quantile of observed episode returns.

The full critic loss combines Bellman error, CQL penalty, and calibrated penalty:

\[ L_{\text{critic}} = L_{\text{Bellman}} + \alpha \left(L_{\text{CQL}} + L_{\text{cal}}\right) \]

The actor follows SAC-style entropy-regularized policy optimization:

\[ L_{\text{actor}} = \mathbb{E}_{a \sim \pi} \left[ \alpha_{\text{ent}} \log \pi(a|s) - \min(Q_1(s,a), Q_2(s,a)) \right] \]

Pseudocode

algorithm Cal-QL:
    initialize SAC-style actor pi, twin critics Q1, Q2, target networks
    initialize calibration threshold Q_cal = 0
    initialize return buffer R

    for step = 1 to total_timesteps do
        collect transition (s, a, r, s', done)
        if episode ends: append episode_return to R
                         Q_cal = quantile(R, calibration_tau)

        sample minibatch from replay buffer

        # Critic update
        L_bellman = standard SAC Bellman loss with twin critics
        L_CQL = logsumexp(Q_random + Q_policy) - Q_data
        L_cal = max(Q - Q_cal, 0) * calibration_tau * (Q - Q_data)
        update critics with L_bellman + cql_alpha * (L_CQL + L_cal)

        # Actor update (SAC-style)
        update actor to maximize Q - alpha_ent * log pi

        # Soft target update
        polyak_update(Q_targets, tau)

Quick Start

from rlox import Trainer

trainer = Trainer("calql", env="HalfCheetah-v4", seed=42, config={
    "cql_alpha": 5.0,
    "calibration_tau": 0.5,
    "learning_rate": 3e-4,
})
metrics = trainer.train(total_timesteps=500_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")

Hyperparameters

Parameter Default Description
learning_rate 3e-4 Learning rate for actor, critic, and alpha optimizers
buffer_size 100_000 Replay buffer capacity
batch_size 256 Minibatch size
gamma 0.99 Discount factor
tau 0.005 Polyak averaging coefficient for target networks
cql_alpha 5.0 CQL penalty weight
calibration_tau 0.5 Quantile for calibration threshold from offline returns
auto_alpha False Auto-tune CQL alpha via dual gradient descent
hidden 256 Hidden layer width
n_random_actions 10 Random actions for CQL penalty estimation
warmup_steps 1000 Random exploration steps before training
seed 42 Random seed

When to Use

  • Use Cal-QL when: you want offline pre-training with online fine-tuning, or vanilla CQL is too pessimistic on your dataset.
  • Prefer Cal-QL over CQL when: your offline data quality varies and you need adaptive conservatism.
  • Do not use Cal-QL when: you have high-quality expert data (prefer BC or IQL for simplicity), or you are doing purely online training (prefer SAC).

References

  • Nakamoto, N., Zhai, S., Singh, A., Mark, M. S., Ma, Y., Finn, C., Kumar, A., & Levine, S. (2023). Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. NeurIPS 2023. arXiv:2303.05479.