class documentation

This class encapsulates a reinforcement learning agent for graph theory applications using the PyTorch-based Proximal Policy Optimization (PPO) method. The agent operates on a configurable environment given as a GraphEnvironment object. In each iteration of the learning process, the agent generates a predetermined number of graphs by playing the graph building game defined by the environment and computes the graph invariant values and all discounted returns for each episode run in parallel. Here, while computing a discounted return, a reward is considered to be the increase between two consecutive graph invariant values. The agent uses an actor-critic architecture, with a torch.nn.Module model (policy network) applied to compute the probability of selecting each action in each step of every episode, and another torch.nn.Module model (value network) applied to estimate the value that quantifies the desirability of each state. Afterwards, the log probabilities and discounted returns of a subset of top-performing episodes are used to train both models according to the PPO algorithm. The training is performed over a configurable number of epochs. This completes one iteration of the learning process. The user provides both models, configures the optimizer, sets the discount factor, selects the number of epochs to be executed, configures the clamping epsilon coefficient and the value loss coefficient, and optionally provides a random action mechanism. When a random action occurs, it is selected uniformly among all actions available in the current state.

Method __init__ This constructor initializes an instance of the PPOAgent class.
Method reset This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.
Method step This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment...
Property best_graph This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a ...
Property best_score This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a ...
Property step_count This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative ...
Instance Variable _best_graph A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.
Instance Variable _best_score A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.
Instance Variable _candidates_count A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.
Instance Variable _clamp_epsilon A float from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration.
Instance Variable _device A torch.device object indicating the device where the models reside.
Instance Variable _discount_factor A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.
Instance Variable _elite_count A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.
Instance Variable _environment A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.
Instance Variable _epochs_count A positive int specifying the number of epochs executed in each learning iteration of the PPO method.
Instance Variable _optimizer A torch.optim.Optimizer object that updates the parameters of both models.
Instance Variable _policy_network A torch.nn.Module object predicting the action probabilities for each step in each episode.
Instance Variable _population_actions Either None if uninitialized, or a numpy.ndarray of type numpy.int32 storing all actions during each episode trajectory. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the action trajectory within an episode and the second to the executed episodes...
Instance Variable _population_returns Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the timestamps (actions) within an episode and the second to the executed episodes...
Instance Variable _population_states Either None if uninitialized, or a numpy.ndarray storing all states during each episode trajectory. Its shape is (episode_length + 1, candidates_count, state_length), where episode_length is the episode length of the RL environment, ...
Instance Variable _random_action_mechanism A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.
Instance Variable _random_generator A numpy.random.Generator object used for all probabilistic decisions.
Instance Variable _step_count A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.
Instance Variable _value_loss_coef A positive float representing the coefficient that scales the value loss while computing the total loss.
Instance Variable _value_loss_function A function implementing the MSE loss used for training the value network.
Instance Variable _value_network A torch.nn.Module object predicting the state values for each step in each episode.
def __init__(self, environment: GraphEnvironment, policy_network: nn.Module, value_network: nn.Module, optimizer: torch.optim.Optimizer, candidates_count: int = 200, elite_count: int | None = None, discount_factor: float = 0.99, epochs_count: int = 4, clamp_epsilon: float = 0.2, value_loss_coef: float = 0.5, random_action_mechanism: RandomActionMechanism = NoRandomActionMechanism(), random_generator: np.random.Generator | None = None):

This constructor initializes an instance of the PPOAgent class.

Parameters
environment:GraphEnvironmentThe RL environment defining the extremal problem and providing the graph building game, given as a GraphEnvironment object.
policy_network:nn.ModuleThe policy network used to compute the probability of each action in each episode and step, given as a torch.nn.Module object.
value_network:nn.ModuleThe value network used to compute the state value of each episode and step, given as a torch.nn.Module object. This value and policy networks must reside in the same device.
optimizer:torch.optim.OptimizerThe optimizer responsible for updating the parameters of both models, given as a torch.optim.Optimizer object. The parameters of both policy_network and value_network must be passed to it.
candidates_count:intA positive int specifying how many graphs are generated in each iteration by running the corresponding number of episodes in parallel. The default value is 200.
elite_count:int | NoneA positive int specifying how many episodes with the greatest graph invariant value are used to train the policy network and value network in each iteration of the learning process, or None to indicate that all executed episodes should be used. The default value is None.
discount_factor:floatA float from the interval [0, 1] representing the discount factor to be used while computing the returns. The default value is 0.99.
epochs_count:intA positive int specifying the number of epochs executed in each learning iteration of the PPO method. The default value is 4.
clamp_epsilon:floatA positive float from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration. The default value is 0.2.
value_loss_coef:floatA positive float representing the coefficient that scales the value loss while computing the total loss. The default value is 0.5.
random_action_mechanism:RandomActionMechanismA RandomActionMechanism object that governs the probability of executing a random action in each step of the graph building game. When a random action is triggered, the agent ignores the action predicted by the policy network and instead selects an action uniformly at random among all available actions. By default, this is NoRandomActionMechanism(), meaning that no random actions are ever executed.
random_generator:np.random.Generator | NoneEither None, or a numpy.random.Generator used for probabilistic decisions. If None, a default generator will be used. The default value is None.
def reset(self):

This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.

def step(self):

This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment. This iteration should update the agent's internal state and improve its policy or decision-making strategy based on the observed outcomes.

best_graph: Graph | None =

This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a Graph object. Otherwise, if no iterations have been executed or the agent has not been initialized, the value None must be returned.

best_score: float | None =

This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a float. If the agent has been initialized but no iterations have yet been executed, the value −∞ must be returned. If the agent has not been initialized, the value None must be returned.

step_count: int | None =

This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative int. If the agent has not yet been initialized, the value None must be returned.

_best_graph: Graph | None =

A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.

_best_score: float | None =

A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.

_candidates_count: int =

A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.

_clamp_epsilon: float =

A float from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration.

_device: torch.device =

A torch.device object indicating the device where the models reside.

_discount_factor: float =

A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.

_elite_count: int | None =

A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.

_environment: GraphEnvironment =

A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.

_epochs_count: int =

A positive int specifying the number of epochs executed in each learning iteration of the PPO method.

A torch.optim.Optimizer object that updates the parameters of both models.

_policy_network: nn.Module =

A torch.nn.Module object predicting the action probabilities for each step in each episode.

_population_actions: np.ndarray | None =

Either None if uninitialized, or a numpy.ndarray of type numpy.int32 storing all actions during each episode trajectory. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the action trajectory within an episode and the second to the executed episodes. The episode order matches _population_states.

_population_returns: np.ndarray | None =

Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the timestamps (actions) within an episode and the second to the executed episodes. The episode order matches _population_states.

_population_states: np.ndarray | None =

Either None if uninitialized, or a numpy.ndarray storing all states during each episode trajectory. Its shape is (episode_length + 1, candidates_count, state_length), where episode_length is the episode length of the RL environment, state_length is the length of the state vectors, and candidates_count is the number of episodes executed in parallel. The first dimension corresponds to the state trajectory within an episode, the second to the executed episodes, and the third to the state vector entries.

_random_action_mechanism: RandomActionMechanism =

A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.

_random_generator: np.random.Generator =

A numpy.random.Generator object used for all probabilistic decisions.

_step_count: int | None =

A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.

_value_loss_coef: float =

A positive float representing the coefficient that scales the value loss while computing the total loss.

_value_loss_function: Callable =

A function implementing the MSE loss used for training the value network.

_value_network: nn.Module =

A torch.nn.Module object predicting the state values for each step in each episode.