GRPO -- Group Relative Policy Optimization¶

Intuition¶

GRPO is a policy optimization algorithm designed for LLM post-training. Instead of training a separate value function (as in PPO), GRPO estimates advantages by generating a group of completions for each prompt and normalizing rewards within the group. Completions that score above the group average receive positive advantage; those below receive negative. This group-relative normalization eliminates the need for a critic network, simplifying the training pipeline and reducing memory footprint. A KL penalty against a frozen reference model prevents the policy from drifting too far.

Key Equations¶

For each prompt \(x\), generate \(G\) completions \(\{y_1, \ldots, y_G\}\) with rewards \(\{r_1, \ldots, r_G\}\).

Group-relative advantages:

\[ A_i = \frac{r_i - \mu_G}{\sigma_G}, \quad \mu_G = \frac{1}{G} \sum_j r_j, \quad \sigma_G = \text{std}(\{r_j\}) \]

GRPO loss:

\[ L(\theta) = -\mathbb{E}_{x, \{y_i\}} \left[ \sum_{i=1}^{G} A_i \sum_{t} \log \pi_\theta(y_{i,t} | x, y_{i,<t}) \right] + \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{ref}}) \]

Per-token KL penalty:

\[ \text{KL} = \mathbb{E}_t \left[ \log \pi_\theta(y_t | \cdot) - \log \pi_{\text{ref}}(y_t | \cdot) \right] \]

Pseudocode¶

algorithm GRPO:
    initialize policy model pi_theta
    freeze reference model pi_ref = copy(pi_theta)
    set group_size G, KL coefficient beta

    for each batch of prompts do
        for each prompt x:
            generate G completions {y_1, ..., y_G} ~ pi_theta
            compute rewards {r_1, ..., r_G} via reward_fn

        # Group-relative advantages (batched via Rust)
        A = compute_batch_group_advantages(rewards, G)

        # Per-token log probs
        logprobs_policy = get_per_token_logprobs(pi_theta, completions)
        logprobs_ref = get_per_token_logprobs(pi_ref, completions)

        # Loss
        seq_logprobs = sum(logprobs_policy, dim=tokens)
        loss = -mean(A * seq_logprobs)
        kl = mean(logprobs_policy - logprobs_ref)
        loss = loss + beta * kl

        update theta with gradient clipping

Quick Start¶

import torch.nn as nn
from rlox.algorithms.grpo import GRPO

model = MyLanguageModel()        # forward(input_ids) -> logits
ref_model = copy.deepcopy(model)
ref_model.eval()

def reward_fn(completions, prompts):
    # Return list of float rewards
    return [score(c) for c in completions]

trainer = GRPO(
    model=model,
    ref_model=ref_model,
    reward_fn=reward_fn,
    group_size=4,
    kl_coef=0.1,
    learning_rate=1e-4,
    max_new_tokens=8,
)
metrics = trainer.train(prompts=prompt_tensor, n_epochs=3)
print(f"Mean reward: {metrics['mean_reward']:.3f}, KL: {metrics['kl']:.4f}")

Hyperparameters¶

Parameter	Default	Description
`group_size`	`4`	Number of completions generated per prompt
`kl_coef`	`0.1`	KL penalty coefficient against reference model
`learning_rate`	`1e-4`	Adam learning rate
`max_new_tokens`	`8`	Maximum tokens to generate per completion
`max_grad_norm`	`1.0`	Gradient clipping norm

When to Use¶

Use GRPO when: you are doing LLM post-training (RLHF) and want to avoid the complexity of training a separate value network.
Prefer GRPO over PPO for LLMs when: memory is constrained (no critic network needed), or the reward signal is well-suited to group-relative comparison.
Do not use GRPO when: you need per-token advantage estimates (PPO with a critic may be more precise), or you are doing standard RL (not LLM training).

References¶

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300.