Functionsยง
- compute_
batch_ group_ advantages - Batched GRPO group advantages: process all groups in a single call.
- compute_
batch_ token_ kl - Batched token-level KL divergence: process all sequences in a single call.
- compute_
batch_ token_ kl_ schulman - Batched token-level KL divergence using the Schulman (2020) estimator.
- compute_
group_ advantages - GRPO group advantage:
(reward - mean) / std. Returns zeros if std < 1e-8. - compute_
token_ kl - Token-level KL divergence:
sum(exp(log_p) * (log_p - log_q)). - compute_
token_ kl_ schulman - Token-level KL divergence using the Schulman (2020) estimator:
sum(exp(log_p - log_q) - (log_p - log_q) - 1).