Skip to content

Policies API

rlox.policies.DiscretePolicy

Bases: Module

MLP actor-critic for discrete action spaces (e.g. CartPole).

Separate actor and critic networks sharing no parameters. The actor outputs logits for a Categorical distribution; the critic outputs a scalar value estimate.

Parameters

obs_dim : int Observation space dimensionality. n_actions : int Number of discrete actions. hidden : int Hidden layer width (default 64, matching CleanRL PPO).

Required interface methods (called by PPOLoss / RolloutCollector): - get_action_and_logprob(obs) → (actions, log_probs) - get_value(obs) → values - get_logprob_and_entropy(obs, actions) → (log_probs, entropy)

get_action_value(obs: torch.Tensor)

Combined action + value in a single call (saves one Python dispatch).

rlox.policies.ContinuousPolicy

Bases: Module

MLP actor-critic for continuous action spaces (e.g. Pendulum).

The actor outputs a mean vector; a separate learnable log_std parameter (state-independent) parameterises the diagonal Gaussian. The critic is a separate MLP producing a scalar value estimate.

Both networks use orthogonal initialisation following the same convention as :class:DiscretePolicy.

Parameters

obs_dim : int Observation space dimensionality. act_dim : int Action space dimensionality. hidden : int Hidden layer width (default 64).

get_action_value(obs: torch.Tensor)

Combined action + value in a single call.

Off-Policy Networks

rlox.networks.SquashedGaussianPolicy

Bases: Module

Gaussian policy with tanh squashing for SAC.

Outputs actions in [-1, 1] with corrected log-probabilities.

sample(obs: torch.Tensor)

Sample action and compute log-prob with tanh correction.

deterministic(obs: torch.Tensor) -> torch.Tensor

Return deterministic action (mean through tanh).

rlox.networks.DeterministicPolicy

Bases: Module

Deterministic policy for TD3.

rlox.networks.QNetwork

Bases: Module

Twin Q-value network for SAC / TD3.

Takes (obs, action) concatenated as input, outputs scalar Q-value.

rlox.networks.SimpleQNetwork

Bases: Module

Standard DQN Q-network.

rlox.networks.DuelingQNetwork

Bases: Module

Dueling DQN architecture: separate value and advantage streams.

Exploration Strategies

rlox.exploration.GaussianNoise

Additive Gaussian noise (used by TD3).

Parameters

sigma : float Standard deviation of the noise. clip : float or None If set, clip noise to [-clip, clip]. seed : int or None Random seed for reproducibility.

rlox.exploration.EpsilonGreedy

Epsilon-greedy exploration (used by DQN).

Linearly decays epsilon from eps_start to eps_end over decay_fraction of total training steps.

Parameters

n_actions : int Number of discrete actions. eps_start : float Initial epsilon (default 1.0). eps_end : float Final epsilon (default 0.05). decay_fraction : float Fraction of total steps over which to decay (default 0.1). seed : int or None Random seed.

rlox.exploration.OUNoise

Ornstein-Uhlenbeck noise for temporally correlated exploration.

Produces smooth, mean-reverting noise that is useful for physical control tasks where abrupt action changes are undesirable.

Parameters

action_dim : int Dimensionality of the action space. mu : float Mean of the noise process (default 0.0). theta : float Rate of mean reversion (default 0.15). sigma : float Volatility (default 0.2). seed : int or None Random seed.

Builders

rlox.builders.PPOBuilder

Fluent builder for PPO algorithm.

All methods return self for chaining. Call .build() to create the PPO instance.

config(**kwargs) -> PPOBuilder

Set arbitrary config parameters.

build()

Create and return the PPO instance.

rlox.builders.SACBuilder

Fluent builder for SAC algorithm.

Supports custom actor, critic, exploration strategy, and all hyperparameters.

critic(c: nn.Module) -> SACBuilder

Set both critics to the same architecture (separate instances).

critics(c1: nn.Module, c2: nn.Module) -> SACBuilder

Set critics independently.

build()

Create and return the SAC instance.

rlox.builders.DQNBuilder

Fluent builder for DQN algorithm.