Policies API¶
rlox.policies.DiscretePolicy
¶
Bases: Module
MLP actor-critic for discrete action spaces (e.g. CartPole).
Separate actor and critic networks sharing no parameters. The actor outputs logits for a Categorical distribution; the critic outputs a scalar value estimate.
Parameters¶
obs_dim : int Observation space dimensionality. n_actions : int Number of discrete actions. hidden : int Hidden layer width (default 64, matching CleanRL PPO).
Required interface methods (called by PPOLoss / RolloutCollector):
- get_action_and_logprob(obs) → (actions, log_probs)
- get_value(obs) → values
- get_logprob_and_entropy(obs, actions) → (log_probs, entropy)
get_action_value(obs: torch.Tensor)
¶
Combined action + value in a single call (saves one Python dispatch).
rlox.policies.ContinuousPolicy
¶
Bases: Module
MLP actor-critic for continuous action spaces (e.g. Pendulum).
The actor outputs a mean vector; a separate learnable log_std
parameter (state-independent) parameterises the diagonal Gaussian.
The critic is a separate MLP producing a scalar value estimate.
Both networks use orthogonal initialisation following the same
convention as :class:DiscretePolicy.
Parameters¶
obs_dim : int Observation space dimensionality. act_dim : int Action space dimensionality. hidden : int Hidden layer width (default 64).
get_action_value(obs: torch.Tensor)
¶
Combined action + value in a single call.
Off-Policy Networks¶
rlox.networks.SquashedGaussianPolicy
¶
Bases: Module
Gaussian policy with tanh squashing for SAC.
Outputs actions in [-1, 1] with corrected log-probabilities.
rlox.networks.DeterministicPolicy
¶
Bases: Module
Deterministic policy for TD3.
rlox.networks.QNetwork
¶
Bases: Module
Twin Q-value network for SAC / TD3.
Takes (obs, action) concatenated as input, outputs scalar Q-value.
rlox.networks.SimpleQNetwork
¶
Bases: Module
Standard DQN Q-network.
rlox.networks.DuelingQNetwork
¶
Bases: Module
Dueling DQN architecture: separate value and advantage streams.
Exploration Strategies¶
rlox.exploration.GaussianNoise
¶
Additive Gaussian noise (used by TD3).
Parameters¶
sigma : float Standard deviation of the noise. clip : float or None If set, clip noise to [-clip, clip]. seed : int or None Random seed for reproducibility.
rlox.exploration.EpsilonGreedy
¶
Epsilon-greedy exploration (used by DQN).
Linearly decays epsilon from eps_start to eps_end over
decay_fraction of total training steps.
Parameters¶
n_actions : int Number of discrete actions. eps_start : float Initial epsilon (default 1.0). eps_end : float Final epsilon (default 0.05). decay_fraction : float Fraction of total steps over which to decay (default 0.1). seed : int or None Random seed.
rlox.exploration.OUNoise
¶
Ornstein-Uhlenbeck noise for temporally correlated exploration.
Produces smooth, mean-reverting noise that is useful for physical control tasks where abrupt action changes are undesirable.
Parameters¶
action_dim : int Dimensionality of the action space. mu : float Mean of the noise process (default 0.0). theta : float Rate of mean reversion (default 0.15). sigma : float Volatility (default 0.2). seed : int or None Random seed.
Builders¶
rlox.builders.PPOBuilder
¶
rlox.builders.SACBuilder
¶
Fluent builder for SAC algorithm.
Supports custom actor, critic, exploration strategy, and all hyperparameters.
rlox.builders.DQNBuilder
¶
Fluent builder for DQN algorithm.