class documentation

This class encapsulates a reinforcement learning agent for graph theory applications using the PyTorch-based REINFORCE method. The agent operates on a configurable environment given as a GraphEnvironment object. In each iteration of the learning process, the agent generates a predetermined number of graphs by playing the graph building game defined by the environment and computes the graph invariant values and all discounted returns for each episode run in parallel. Here, while computing a discounted return, a reward is considered to be the increase between two consecutive graph invariant values. The agent uses a torch.nn.Module model to compute the probability of selecting each action in each step of every episode. Afterwards, the log probabilities and discounted returns of a subset of top-performing episodes are used to train the model according to the REINFORCE algorithm. This completes one iteration of the learning process. The user provides the model, configures the optimizer, sets the discount factor, decides whether to apply a baseline to reduce variance, and optionally provides a random action mechanism. When a random action occurs, it is selected uniformly among all actions available in the current state.

Method __init__ This constructor initializes an instance of the ReinforceAgent class.
Method reset This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.
Method step This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment...
Property best_graph This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a ...
Property best_score This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a ...
Property step_count This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative ...
Instance Variable _apply_baseline A bool indicating whether a baseline should be applied to reduce variance. If True, the baseline is the mean return over all episodes, computed independently for each step.
Instance Variable _best_graph A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.
Instance Variable _best_score A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.
Instance Variable _candidates_count A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.
Instance Variable _device A torch.device object indicating the device where the model resides.
Instance Variable _discount_factor A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.
Instance Variable _elite_count A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.
Instance Variable _environment A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.
Instance Variable _optimizer A torch.optim.Optimizer object that updates the model parameters.
Instance Variable _policy_network A torch.nn.Module object predicting the action probabilities for each step in each episode.
Instance Variable _population_returns Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where episode_length is the episode length of the RL environment and ...
Instance Variable _random_action_mechanism A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.
Instance Variable _random_generator A numpy.random.Generator object used for all probabilistic decisions.
Instance Variable _step_count A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.
def __init__(self, environment: GraphEnvironment, policy_network: nn.Module, optimizer: torch.optim.Optimizer, candidates_count: int = 200, elite_count: int | None = None, discount_factor: float = 0.99, apply_baseline: bool = True, random_action_mechanism: RandomActionMechanism = NoRandomActionMechanism(), random_generator: np.random.Generator | None = None):

This constructor initializes an instance of the ReinforceAgent class.

Parameters
environment:GraphEnvironmentThe RL environment defining the extremal problem and providing the graph building game, given as a GraphEnvironment object.
policy_network:nn.ModuleThe policy network used to compute the probability of each action in each episode and step, given as a torch.nn.Module object.
optimizer:torch.optim.OptimizerThe optimizer responsible for updating the model parameters, given as a torch.optim.Optimizer object. The parameters of policy_network must be passed to it.
candidates_count:intA positive int specifying how many graphs are generated in each iteration by running the corresponding number of episodes in parallel. The default value is 200.
elite_count:int | NoneA positive int specifying how many episodes with the greatest graph invariant value are used to train the policy network in each iteration of the learning process, or None to indicate that all executed episodes should be used. The default value is None.
discount_factor:floatA float from the interval [0, 1] representing the discount factor to be used while computing the returns. The default value is 0.99.
apply_baseline:boolA bool indicating whether a baseline should be applied to reduce variance. If True, the baseline is the mean return over all elite episodes, computed independently for each step. The default value is True.
random_action_mechanism:RandomActionMechanismA RandomActionMechanism object that governs the probability of executing a random action in each step of the graph building game. When a random action is triggered, the agent ignores the action predicted by the policy network and instead selects an action uniformly at random among all available actions. By default, this is NoRandomActionMechanism(), meaning that no random actions are ever executed.
random_generator:np.random.Generator | NoneEither None, or a numpy.random.Generator used for probabilistic decisions. If None, a default generator will be used. The default value is None.
def reset(self):

This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.

def step(self):

This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment. This iteration should update the agent's internal state and improve its policy or decision-making strategy based on the observed outcomes.

best_graph: Graph | None =

This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a Graph object. Otherwise, if no iterations have been executed or the agent has not been initialized, the value None must be returned.

best_score: float | None =

This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a float. If the agent has been initialized but no iterations have yet been executed, the value −∞ must be returned. If the agent has not been initialized, the value None must be returned.

step_count: int | None =

This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative int. If the agent has not yet been initialized, the value None must be returned.

_apply_baseline: bool =

A bool indicating whether a baseline should be applied to reduce variance. If True, the baseline is the mean return over all episodes, computed independently for each step.

_best_graph: Graph | None =

A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.

_best_score: float | None =

A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.

_candidates_count: int =

A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.

_device: torch.device =

A torch.device object indicating the device where the model resides.

_discount_factor: float =

A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.

_elite_count: int | None =

A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.

_environment: GraphEnvironment =

A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.

A torch.optim.Optimizer object that updates the model parameters.

_policy_network: nn.Module =

A torch.nn.Module object predicting the action probabilities for each step in each episode.

_population_returns: np.ndarray | None =

Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where episode_length is the episode length of the RL environment and candidates_count is the number of episodes executed in parallel. The first dimension corresponds to the timestamps (actions) within an episode, and the second corresponds to the executed episodes.

_random_action_mechanism: RandomActionMechanism =

A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.

_random_generator: np.random.Generator =

A numpy.random.Generator object used for all probabilistic decisions.

_step_count: int | None =

A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.