QMIX -- Monotonic Value Function Factorisation¶

Intuition¶

QMIX solves cooperative multi-agent tasks by decomposing the joint team Q-value into per-agent utility functions combined through a monotonic mixing network. Each agent learns its own Q-network from local observations, while a hypernetwork (conditioned on the global state) generates the mixing weights. The key insight is enforcing monotonicity -- the joint Q-value is monotonically increasing in each agent's utility -- which guarantees that greedy action selection on the joint Q-value equals independent greedy selection per agent (decentralized execution).

Key Equations¶

Per-agent Q-values are combined via the mixing network:

\[ Q_{\text{tot}}(s, \mathbf{a}) = f_{\text{mix}}\!\left(Q_1(o_1, a_1), \ldots, Q_n(o_n, a_n); s\right) \]

Monotonicity is enforced by constraining mixing weights to be non-negative:

\[ \frac{\partial Q_{\text{tot}}}{\partial Q_i} \geq 0, \quad \forall i \]

This is achieved by applying \(|\cdot|\) (absolute value) to hypernetwork outputs:

\[ W = |f_{\text{hyper}}(s)| \]

The loss is standard TD error on the joint Q-value:

\[ L = \mathbb{E}\!\left[\left(r + \gamma \max_{\mathbf{a}'} Q_{\text{tot}}^-(s', \mathbf{a}') - Q_{\text{tot}}(s, \mathbf{a})\right)^2\right] \]

Pseudocode¶

algorithm QMIX:
    initialize per-agent Q-networks {Q_i} and target networks {Q_i^-}
    initialize mixing network M and target M^-
    initialize replay buffer D

    for step = 1 to total_timesteps do
        for each agent i:
            select a_i using epsilon-greedy on Q_i(o_i, .)

        execute joint action, observe r, s', done
        store (s, {o_i}, {a_i}, r, done, s', {o_i'}) in D

        sample minibatch from D
        for each agent i:
            q_i = Q_i(o_i, a_i)       # chosen action Q-value
            q_i_target = max_a Q_i^-(o_i', a)  # greedy target

        q_tot = M({q_i}, global_state)
        q_tot_target = M^-({q_i_target}, global_state')
        L = MSE(q_tot, r + gamma * (1 - done) * q_tot_target)

        update all Q_i and M parameters
        periodically hard-copy to target networks

Quick Start¶

from rlox import Trainer

trainer = Trainer("qmix", env="CartPole-v1", seed=42, config={
    "n_agents": 3,
    "hidden_dim": 64,
    "mixing_embed_dim": 32,
})
metrics = trainer.train(total_timesteps=50_000)
print(f"Mean reward: {metrics['mean_reward']:.1f}")

Hyperparameters¶

Parameter	Default	Description
`n_agents`	`3`	Number of cooperative agents
`hidden_dim`	`64`	Hidden dimension for per-agent Q-networks
`mixing_embed_dim`	`32`	Hidden dimension of the mixing network
`learning_rate`	`5e-4`	Adam learning rate
`buffer_size`	`50_000`	Replay buffer capacity
`batch_size`	`32`	Minibatch size
`gamma`	`0.99`	Discount factor
`target_update_freq`	`200`	Steps between hard target network updates
`epsilon_start`	`1.0`	Initial exploration epsilon
`epsilon_end`	`0.05`	Final exploration epsilon
`epsilon_decay_steps`	`5000`	Linear epsilon decay duration
`seed`	`42`	Random seed

When to Use¶

Use QMIX when: you have a cooperative multi-agent task with discrete actions and shared rewards, and agents can only observe local information at execution time.
Prefer QMIX over MAPPO when: you want value decomposition with replay buffer efficiency, or the task has a clear cooperative structure where monotonicity holds.
Do not use QMIX when: the task requires non-monotonic value decomposition (some cooperative games violate this), or agents need continuous action spaces (prefer MAPPO).

References¶

Rashid, T., Samvelyan, M., Schroeder de Witt, C., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ICML 2018. arXiv:1803.11485.