BC -- Behavioral Cloning¶

Intuition¶

Behavioral Cloning is the simplest possible approach to learning from demonstrations: treat it as supervised learning. Given a dataset of expert state-action pairs, BC trains a neural network to predict the expert's action from the current state. For continuous actions this is a regression problem (MSE loss); for discrete actions it is classification (cross-entropy loss). BC serves as a strong baseline and is often the right choice when you have high-quality expert data and do not need to improve beyond the demonstrator.

Key Equations¶

Continuous actions (regression):

\[ L(\theta) = \mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \left\| \pi_\theta(s) - a \right\|^2 \right] \]

Discrete actions (classification):

\[ L(\theta) = -\mathbb{E}_{(s,a) \sim \mathcal{D}} \left[ \log \pi_\theta(a | s) \right] \]

Pseudocode¶

algorithm Behavioral Cloning:
    initialize policy network pi_theta
    load offline dataset D = {(s_i, a_i)}

    for update = 1 to n_updates do
        sample minibatch {(s, a)} from D

        if continuous:
            pred = pi_theta(s)
            loss = MSE(pred, a)
        else:
            logits = pi_theta(s)
            loss = cross_entropy(logits, a)

        update theta with Adam

Quick Start¶

BC uses the offline algorithm interface:

from rlox.offline import OfflineDatasetBuffer
from rlox.algorithms.bc import BC

dataset = OfflineDatasetBuffer.from_d4rl("halfcheetah-expert-v2")
agent = BC(
    dataset=dataset,
    obs_dim=17,
    act_dim=6,
    continuous=True,
    batch_size=256,
)
metrics = agent.train(n_updates=50_000)

For discrete actions:

agent = BC(
    dataset=dataset,
    obs_dim=4,
    act_dim=2,       # number of discrete actions
    continuous=False,
    batch_size=256,
)
metrics = agent.train(n_updates=50_000)

Hyperparameters¶

Parameter	Default	Description
`continuous`	`True`	Whether the action space is continuous
`hidden`	`256`	Hidden layer width
`learning_rate`	`3e-4`	Adam learning rate
`batch_size`	`256`	Minibatch size

When to Use¶

Use BC when: you have high-quality expert demonstrations and want the simplest possible baseline, or as a pre-training step before fine-tuning with RL.
Prefer BC over offline RL methods when: the dataset is expert-quality and you do not need to stitch together sub-optimal trajectories.
Do not use BC when: the demonstration data is sub-optimal or multi-modal (prefer IQL, Diffusion Policy, or AWR), or when you need to improve beyond the demonstrator's performance.

References¶

Pomerleau, D. A. (1989). ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS 1989.
Bain, M. & Sammut, C. (1995). A Framework for Behavioural Cloning. Machine Intelligence 15, pp. 103-129.