Cal-QL -- Calibrated Conservative Q-Learning¶
Intuition¶
Cal-QL addresses a key limitation of CQL: vanilla CQL applies the same conservative penalty regardless of how close an action is to the training distribution, leading to overly pessimistic Q-values for near-distribution actions. Cal-QL introduces a calibration mechanism that scales the conservative penalty based on the gap between current Q-values and an empirical threshold derived from offline returns. When Q-values are already well-calibrated (close to actual returns), the penalty is reduced. This makes Cal-QL particularly effective for offline pre-training followed by online fine-tuning.
Key Equations¶
The standard CQL penalty pushes down Q-values for out-of-distribution actions:
Cal-QL adds a calibrated scaling factor:
where \(Q_{\text{cal}}\) is the \(\tau\)-quantile of observed episode returns.
The full critic loss combines Bellman error, CQL penalty, and calibrated penalty:
The actor follows SAC-style entropy-regularized policy optimization:
Pseudocode¶
algorithm Cal-QL:
initialize SAC-style actor pi, twin critics Q1, Q2, target networks
initialize calibration threshold Q_cal = 0
initialize return buffer R
for step = 1 to total_timesteps do
collect transition (s, a, r, s', done)
if episode ends: append episode_return to R
Q_cal = quantile(R, calibration_tau)
sample minibatch from replay buffer
# Critic update
L_bellman = standard SAC Bellman loss with twin critics
L_CQL = logsumexp(Q_random + Q_policy) - Q_data
L_cal = max(Q - Q_cal, 0) * calibration_tau * (Q - Q_data)
update critics with L_bellman + cql_alpha * (L_CQL + L_cal)
# Actor update (SAC-style)
update actor to maximize Q - alpha_ent * log pi
# Soft target update
polyak_update(Q_targets, tau)
Quick Start¶
from rlox import Trainer
trainer = Trainer("calql", env="HalfCheetah-v4", seed=42, config={
"cql_alpha": 5.0,
"calibration_tau": 0.5,
"learning_rate": 3e-4,
})
metrics = trainer.train(total_timesteps=500_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")
Hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
learning_rate |
3e-4 |
Learning rate for actor, critic, and alpha optimizers |
buffer_size |
100_000 |
Replay buffer capacity |
batch_size |
256 |
Minibatch size |
gamma |
0.99 |
Discount factor |
tau |
0.005 |
Polyak averaging coefficient for target networks |
cql_alpha |
5.0 |
CQL penalty weight |
calibration_tau |
0.5 |
Quantile for calibration threshold from offline returns |
auto_alpha |
False |
Auto-tune CQL alpha via dual gradient descent |
hidden |
256 |
Hidden layer width |
n_random_actions |
10 |
Random actions for CQL penalty estimation |
warmup_steps |
1000 |
Random exploration steps before training |
seed |
42 |
Random seed |
When to Use¶
- Use Cal-QL when: you want offline pre-training with online fine-tuning, or vanilla CQL is too pessimistic on your dataset.
- Prefer Cal-QL over CQL when: your offline data quality varies and you need adaptive conservatism.
- Do not use Cal-QL when: you have high-quality expert data (prefer BC or IQL for simplicity), or you are doing purely online training (prefer SAC).
References¶
- Nakamoto, N., Zhai, S., Singh, A., Mark, M. S., Ma, Y., Finn, C., Kumar, A., & Levine, S. (2023). Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning. NeurIPS 2023. arXiv:2303.05479.