CQL -- Conservative Q-Learning¶
Intuition¶
CQL tackles the fundamental challenge of offline RL: Q-value overestimation on out-of-distribution actions. Without online interaction to correct errors, standard Q-learning can assign arbitrarily high values to state-action pairs never seen in the dataset. CQL adds a regularizer that pushes down Q-values for actions sampled from the current policy (which may be out-of-distribution) while pushing up Q-values for actions in the dataset. The result is a conservative lower bound on the true Q-function that prevents the policy from exploiting overestimated values.
Key Equations¶
The CQL regularizer augments the standard Bellman loss:
This encourages \(Q(s,a)\) to be high for dataset actions and low for out-of-distribution actions.
The full critic loss:
The penalty is estimated using random actions and policy-sampled actions:
Optionally, \(\alpha\) can be auto-tuned via a Lagrangian:
Pseudocode¶
algorithm CQL:
initialize SAC-style actor pi, twin critics Q1, Q2, targets
load offline dataset D
for update = 1 to n_updates do
sample minibatch from D
# Critic update
L_bellman = standard SAC Bellman loss
L_CQL = logsumexp(Q_random + Q_policy) - Q_data
update critics with L_bellman + alpha * L_CQL
# (Optional) auto-tune alpha
alpha_loss = alpha * (target_value - L_CQL)
update alpha
# Actor update (SAC-style)
update pi to maximize Q - alpha_ent * log pi
# Entropy alpha update (SAC-style)
# Soft target update
Quick Start¶
CQL uses the offline algorithm interface with a pre-loaded dataset:
from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.cql import CQL
dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-medium-v2")
agent = CQL(
dataset=dataset,
obs_dim=17,
act_dim=6,
cql_alpha=5.0,
batch_size=256,
)
metrics = agent.train(n_updates=100_000)
Hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
cql_alpha |
5.0 |
CQL penalty weight |
n_random_actions |
10 |
Random actions for CQL penalty estimation |
auto_alpha |
False |
Auto-tune CQL alpha via Lagrangian |
cql_target_value |
-1.0 |
Target value for Lagrangian alpha tuning |
hidden |
256 |
Hidden layer width |
learning_rate |
3e-4 |
Learning rate |
tau |
0.005 |
Soft target update rate |
gamma |
0.99 |
Discount factor |
batch_size |
256 |
Minibatch size |
auto_entropy |
True |
Auto-tune SAC entropy alpha |
target_entropy |
-act_dim |
Target entropy for SAC |
When to Use¶
- Use CQL when: you have a fixed offline dataset and need a principled conservative estimate of Q-values to avoid overestimation.
- Prefer CQL over BC when: the offline data is sub-optimal and you need to improve beyond the behavior policy.
- Do not use CQL when: the conservatism is too aggressive for your data distribution (try Cal-QL for adaptive conservatism), or you have expert-quality data (prefer BC or IQL).
References¶
- Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020. arXiv:2006.04779.