Decision Transformer / Offline RL¶
Chen et al., "Decision Transformer: Reinforcement Learning via Sequence Modeling," NeurIPS, 2021. Janner et al., "Offline Reinforcement Learning as One Big Sequence Modeling Problem," NeurIPS, 2021.
Key Idea¶
Decision Transformer reframes RL as a sequence modeling problem. Instead of learning value functions or policy gradients, it uses a causal Transformer (GPT-style) to predict actions conditioned on desired returns, past states, and past actions. At test time, conditioning on a high return-to-go elicits high-performing behavior. This leverages Transformer power and sidesteps TD-learning instabilities, though it requires a pre-collected offline dataset.
Mathematical Formulation¶
Input sequence (context window of length K):
τ = (R̂_1, s_1, a_1, R̂_2, s_2, a_2, ..., R̂_T, s_T, a_T)
where R̂_t = Σ_{k=t}^{T} r_k (return-to-go)
Training objective (autoregressive on actions):
At inference:
The model applies a GPT-style Transformer with causal masking over interleaved (R̂, s, a) tokens.
Properties¶
- Offline (trains on pre-collected dataset)
- Neither value-based nor policy-gradient — sequence modeling
- Model-free (though the Transformer implicitly models dynamics)
- Return-conditioned: desired return acts as a "goal"
Key Hyperparameters¶
| Parameter | Typical Value | Notes |
|---|---|---|
| Context length K | 20 | (R̂,s,a) triples |
| Transformer layers | 3–6 | GPT-2 style |
| Attention heads | 4–8 | |
| Embedding dim | 128–256 | |
| Learning rate | 1e-4 (Adam) | |
| Batch size | 64 | Sequences |
| Dropout | 0.1 | |
| Return-to-go (test) | Max in dataset | Or higher for extrapolation |
Complexity¶
- Training: Standard Transformer cost O(K² × d_model) per step
- Inference: Autoregressive, but only predicts actions — fast per step
- Memory: Context window bounded by K — manageable
- Data: Requires large offline dataset of varied quality
Primary Use Cases¶
- Offline RL benchmarks (D4RL): MuJoCo locomotion, Antmaze
- Multi-task and multi-domain agents (Gato)
- Settings where offline data is abundant but online interaction is expensive
- Language-conditioned control
- Game playing from demonstrations
Known Limitations¶
- Cannot exceed best trajectories in dataset (no stitching in base form)
- No TD bootstrapping — cannot combine sub-optimal trajectory segments
- Return conditioning assumes the agent can achieve the conditioned return
- Requires high-quality data with diverse returns
- Underperforms TD-learning methods (CQL, IQL) on many D4RL benchmarks
- Not designed for online interaction (though Online DT variants exist)
- More conceptually elegant than practically dominant
Major Variants¶
| Variant | Reference | Key Change |
|---|---|---|
| Trajectory Transformer | Janner et al., NeurIPS 2021 | Models states/rewards too, beam search |
| Online DT | Lee et al., ICML 2022 | Fine-tunes with online interaction |
| Elastic DT | Wu et al., NeurIPS 2023 | Variable-length history |
| Gato | Reed et al., 2022 | Multi-modal, multi-task generalist |
| QDT | Yamagata et al., 2022 | Combines TD-learning with seq modeling |
Relationship to Other Algorithms¶
- Fundamentally different paradigm from PPO/SAC/DQN — casts RL as supervised learning
- Competes with offline RL methods: CQL, IQL, BCQ
- Gato connects to the "foundation model for control" vision
- The paradigm informed LLM-as-agent approaches
- Dreamer also implicitly models dynamics but trains with imagination, not conditioning
Industry Deployment¶
- DeepMind: Gato (multi-modal agent)
- Research-stage at most companies; not widely deployed in production
- Influential conceptually — bridged NLP and RL communities
- Informed LLM-as-agent approaches (but those typically use prompting, not DT-style training)