rlgt.agents.ppo_agent.PPOAgent

class documentation

class PPOAgent(GraphAgent):

Constructor: PPOAgent(environment, policy_network, value_network, optimizer, ...)

This class encapsulates a reinforcement learning agent for graph theory applications using the PyTorch-based Proximal Policy Optimization (PPO) method. The agent operates on a configurable environment given as a GraphEnvironment object. In each iteration of the learning process, the agent generates a predetermined number of graphs by playing the graph building game defined by the environment and computes the graph invariant values and all discounted returns for each episode run in parallel. Here, while computing a discounted return, a reward is considered to be the increase between two consecutive graph invariant values. The agent uses an actor-critic architecture, with a torch.nn.Module model (policy network) applied to compute the probability of selecting each action in each step of every episode, and another torch.nn.Module model (value network) applied to estimate the value that quantifies the desirability of each state. Afterwards, the log probabilities and discounted returns of a subset of top-performing episodes are used to train both models according to the PPO algorithm. The training is performed over a configurable number of epochs. This completes one iteration of the learning process. The user provides both models, configures the optimizer, sets the discount factor, selects the number of epochs to be executed, configures the clamping epsilon coefficient and the value loss coefficient, and optionally provides a random action mechanism. When a random action occurs, it is selected uniformly among all actions available in the current state.

Method	`__init__`	This constructor initializes an instance of the `PPOAgent` class.
Method	`reset`	This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.
Method	`step`	This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment...
Property	`best_graph`	This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a ...
Property	`best_score`	This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a ...
Property	`step_count`	This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative ...
Instance Variable	`_best_graph`	A `Graph` object representing a graph attaining the best achieved value for the graph invariant, or `None` if the agent has not been initialized or no iterations have been executed.
Instance Variable	`_best_score`	A `float` representing the best achieved value for the graph invariant, or `None` if the agent has not been initialized.
Instance Variable	`_candidates_count`	A positive `int` specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.
Instance Variable	`_clamp_epsilon`	A `float` from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration.
Instance Variable	`_device`	A `torch.device` object indicating the device where the models reside.
Instance Variable	`_discount_factor`	A `float` from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.
Instance Variable	`_elite_count`	A positive `int` specifying the number of top-performing episodes used to train the model in each iteration, or `None` if all the episodes should be used.
Instance Variable	`_environment`	A `GraphEnvironment` object defining the extremal problem and providing the graph building game used to construct all the graphs.
Instance Variable	`_epochs_count`	A positive `int` specifying the number of epochs executed in each learning iteration of the PPO method.
Instance Variable	`_optimizer`	A `torch.optim.Optimizer` object that updates the parameters of both models.
Instance Variable	`_policy_network`	A `torch.nn.Module` object predicting the action probabilities for each step in each episode.
Instance Variable	`_population_actions`	Either `None` if uninitialized, or a `numpy.ndarray` of type `numpy.int32` storing all actions during each episode trajectory. Its shape is `(episode_length, candidates_count)`, where the first dimension corresponds to the action trajectory within an episode and the second to the executed episodes...
Instance Variable	`_population_returns`	Either `None` if uninitialized, or a `numpy.ndarray` of type `numpy.float32` storing the discounted returns for all executed episodes. Its shape is `(episode_length, candidates_count)`, where the first dimension corresponds to the timestamps (actions) within an episode and the second to the executed episodes...
Instance Variable	`_population_states`	Either `None` if uninitialized, or a `numpy.ndarray` storing all states during each episode trajectory. Its shape is `(episode_length + 1, candidates_count, state_length)`, where `episode_length` is the episode length of the RL environment, ...
Instance Variable	`_random_action_mechanism`	A `RandomActionMechanism` object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.
Instance Variable	`_random_generator`	A `numpy.random.Generator` object used for all probabilistic decisions.
Instance Variable	`_step_count`	A nonnegative `int` representing the number of executed iterations, or `None` if the agent has not been initialized.
Instance Variable	`_value_loss_coef`	A positive `float` representing the coefficient that scales the value loss while computing the total loss.
Instance Variable	`_value_loss_function`	A function implementing the MSE loss used for training the value network.
Instance Variable	`_value_network`	A `torch.nn.Module` object predicting the state values for each step in each episode.

def __init__(self, environment: GraphEnvironment, policy_network: nn.Module, value_network: nn.Module, optimizer: torch.optim.Optimizer, candidates_count: int = 200, elite_count: int | None = None, discount_factor: float = 0.99, epochs_count: int = 4, clamp_epsilon: float = 0.2, value_loss_coef: float = 0.5, random_action_mechanism: RandomActionMechanism = NoRandomActionMechanism(), random_generator: np.random.Generator | None = None): ¶

This constructor initializes an instance of the PPOAgent class.

Parameters
environment:`GraphEnvironment`	The RL environment defining the extremal problem and providing the graph building game, given as a `GraphEnvironment` object.
policy_network:`nn.Module`	The policy network used to compute the probability of each action in each episode and step, given as a `torch.nn.Module` object.
value_network:`nn.Module`	The value network used to compute the state value of each episode and step, given as a `torch.nn.Module` object. This value and policy networks must reside in the same device.
optimizer:`torch.optim.Optimizer`	The optimizer responsible for updating the parameters of both models, given as a `torch.optim.Optimizer` object. The parameters of both `policy_network` and `value_network` must be passed to it.
candidates_count:`int`	A positive `int` specifying how many graphs are generated in each iteration by running the corresponding number of episodes in parallel. The default value is 200.
elite_count:`int \| None`	A positive `int` specifying how many episodes with the greatest graph invariant value are used to train the policy network and value network in each iteration of the learning process, or `None` to indicate that all executed episodes should be used. The default value is `None`.
discount_factor:`float`	A `float` from the interval [0, 1] representing the discount factor to be used while computing the returns. The default value is 0.99.
epochs_count:`int`	A positive `int` specifying the number of epochs executed in each learning iteration of the PPO method. The default value is 4.
clamp_epsilon:`float`	A positive `float` from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration. The default value is 0.2.
value_loss_coef:`float`	A positive `float` representing the coefficient that scales the value loss while computing the total loss. The default value is 0.5.
random_action_mechanism:`RandomActionMechanism`	A `RandomActionMechanism` object that governs the probability of executing a random action in each step of the graph building game. When a random action is triggered, the agent ignores the action predicted by the policy network and instead selects an action uniformly at random among all available actions. By default, this is `NoRandomActionMechanism()`, meaning that no random actions are ever executed.
random_generator:`np.random.Generator \| None`	Either `None`, or a `numpy.random.Generator` used for probabilistic decisions. If `None`, a default generator will be used. The default value is `None`.

def reset(self): ¶

overrides rlgt.agents.graph_agent.GraphAgent.reset

This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.

def step(self): ¶

overrides rlgt.agents.graph_agent.GraphAgent.step

This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment. This iteration should update the agent's internal state and improve its policy or decision-making strategy based on the observed outcomes.

@property

best_graph: Graph | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.best_graph

This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a Graph object. Otherwise, if no iterations have been executed or the agent has not been initialized, the value None must be returned.

@property

best_score: float | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.best_score

This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a float. If the agent has been initialized but no iterations have yet been executed, the value −∞ must be returned. If the agent has not been initialized, the value None must be returned.

@property

step_count: int | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.step_count

This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative int. If the agent has not yet been initialized, the value None must be returned.

_best_graph: Graph | None = ¶

A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.

_best_score: float | None = ¶

A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.

_candidates_count: int = ¶

A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.

_clamp_epsilon: float = ¶

A float from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration.

_device: torch.device = ¶

A torch.device object indicating the device where the models reside.

_discount_factor: float = ¶

A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.

_elite_count: int | None = ¶

A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.

_environment: GraphEnvironment = ¶

A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.

_epochs_count: int = ¶

A positive int specifying the number of epochs executed in each learning iteration of the PPO method.

_optimizer: torch.optim.Optimizer = ¶

A torch.optim.Optimizer object that updates the parameters of both models.

_policy_network: nn.Module = ¶

A torch.nn.Module object predicting the action probabilities for each step in each episode.

_population_actions: np.ndarray | None = ¶

Either None if uninitialized, or a numpy.ndarray of type numpy.int32 storing all actions during each episode trajectory. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the action trajectory within an episode and the second to the executed episodes. The episode order matches _population_states.

_population_returns: np.ndarray | None = ¶

Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the timestamps (actions) within an episode and the second to the executed episodes. The episode order matches _population_states.

_population_states: np.ndarray | None = ¶

Either None if uninitialized, or a numpy.ndarray storing all states during each episode trajectory. Its shape is (episode_length + 1, candidates_count, state_length), where episode_length is the episode length of the RL environment, state_length is the length of the state vectors, and candidates_count is the number of episodes executed in parallel. The first dimension corresponds to the state trajectory within an episode, the second to the executed episodes, and the third to the state vector entries.

_random_action_mechanism: RandomActionMechanism = ¶

A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.

_random_generator: np.random.Generator = ¶

A numpy.random.Generator object used for all probabilistic decisions.

_step_count: int | None = ¶

A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.

_value_loss_coef: float = ¶

A positive float representing the coefficient that scales the value loss while computing the total loss.

_value_loss_function: Callable = ¶

A function implementing the MSE loss used for training the value network.

_value_network: nn.Module = ¶

A torch.nn.Module object predicting the state values for each step in each episode.