Skip to content

BC -- Behavioral Cloning

Intuition

Behavioral Cloning is the simplest possible approach to learning from demonstrations: treat it as supervised learning. Given a dataset of expert state-action pairs, BC trains a neural network to predict the expert's action from the current state. For continuous actions this is a regression problem (MSE loss); for discrete actions it is classification (cross-entropy loss). BC serves as a strong baseline and is often the right choice when you have high-quality expert data and do not need to improve beyond the demonstrator.

Key Equations

Continuous actions (regression):

\[ L(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \left\| \pi_\theta(s) - a \right\|^2 \right] \]

Discrete actions (classification):

\[ L(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \log \pi_\theta(a | s) \right] \]

Pseudocode

algorithm Behavioral Cloning:
    initialize policy network pi_theta
    load offline dataset D = {(s_i, a_i)}

    for update = 1 to n_updates do
        sample minibatch {(s, a)} from D

        if continuous:
            pred = pi_theta(s)
            loss = MSE(pred, a)
        else:
            logits = pi_theta(s)
            loss = cross_entropy(logits, a)

        update theta with Adam

Quick Start

BC uses the offline algorithm interface:

from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.bc import BC

dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-expert-v2")
agent = BC(
    dataset=dataset,
    obs_dim=17,
    act_dim=6,
    continuous=True,
    batch_size=256,
)
metrics = agent.train(n_updates=50_000)

For discrete actions:

agent = BC(
    dataset=dataset,
    obs_dim=4,
    act_dim=2,       # number of discrete actions
    continuous=False,
    batch_size=256,
)
metrics = agent.train(n_updates=50_000)

Hyperparameters

Parameter Default Description
continuous True Whether the action space is continuous
hidden 256 Hidden layer width
learning_rate 3e-4 Adam learning rate
batch_size 256 Minibatch size

When to Use

  • Use BC when: you have high-quality expert demonstrations and want the simplest possible baseline, or as a pre-training step before fine-tuning with RL.
  • Prefer BC over offline RL methods when: the dataset is expert-quality and you do not need to stitch together sub-optimal trajectories.
  • Do not use BC when: the demonstration data is sub-optimal or multi-modal (prefer IQL, Diffusion Policy, or AWR), or when you need to improve beyond the demonstrator's performance.

References

  • Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS 1989.
  • Bain, M. & Sammut, C. (1995). A Framework for Behavioural Cloning. Machine Intelligence 15, pp. 103-129.