rlgt.agents.reinforce_agent.ReinforceAgent

class documentation

class ReinforceAgent(GraphAgent):

Constructor: ReinforceAgent(environment, policy_network, optimizer, candidates_count, ...)

This class encapsulates a reinforcement learning agent for graph theory applications using the PyTorch-based REINFORCE method. The agent operates on a configurable environment given as a GraphEnvironment object. In each iteration of the learning process, the agent generates a predetermined number of graphs by playing the graph building game defined by the environment and computes the graph invariant values and all discounted returns for each episode run in parallel. Here, while computing a discounted return, a reward is considered to be the increase between two consecutive graph invariant values. The agent uses a torch.nn.Module model to compute the probability of selecting each action in each step of every episode. Afterwards, the log probabilities and discounted returns of a subset of top-performing episodes are used to train the model according to the REINFORCE algorithm. This completes one iteration of the learning process. The user provides the model, configures the optimizer, sets the discount factor, decides whether to apply a baseline to reduce variance, and optionally provides a random action mechanism. When a random action occurs, it is selected uniformly among all actions available in the current state.

Method	`__init__`	This constructor initializes an instance of the `ReinforceAgent` class.
Method	`reset`	This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.
Method	`step`	This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment...
Property	`best_graph`	This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a ...
Property	`best_score`	This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a ...
Property	`step_count`	This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative ...
Instance Variable	`_apply_baseline`	A `bool` indicating whether a baseline should be applied to reduce variance. If `True`, the baseline is the mean return over all episodes, computed independently for each step.
Instance Variable	`_best_graph`	A `Graph` object representing a graph attaining the best achieved value for the graph invariant, or `None` if the agent has not been initialized or no iterations have been executed.
Instance Variable	`_best_score`	A `float` representing the best achieved value for the graph invariant, or `None` if the agent has not been initialized.
Instance Variable	`_candidates_count`	A positive `int` specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.
Instance Variable	`_device`	A `torch.device` object indicating the device where the model resides.
Instance Variable	`_discount_factor`	A `float` from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.
Instance Variable	`_elite_count`	A positive `int` specifying the number of top-performing episodes used to train the model in each iteration, or `None` if all the episodes should be used.
Instance Variable	`_environment`	A `GraphEnvironment` object defining the extremal problem and providing the graph building game used to construct all the graphs.
Instance Variable	`_optimizer`	A `torch.optim.Optimizer` object that updates the model parameters.
Instance Variable	`_policy_network`	A `torch.nn.Module` object predicting the action probabilities for each step in each episode.
Instance Variable	`_population_returns`	Either `None` if uninitialized, or a `numpy.ndarray` of type `numpy.float32` storing the discounted returns for all executed episodes. Its shape is `(episode_length, candidates_count)`, where `episode_length` is the episode length of the RL environment and ...
Instance Variable	`_random_action_mechanism`	A `RandomActionMechanism` object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.
Instance Variable	`_random_generator`	A `numpy.random.Generator` object used for all probabilistic decisions.
Instance Variable	`_step_count`	A nonnegative `int` representing the number of executed iterations, or `None` if the agent has not been initialized.

def __init__(self, environment: GraphEnvironment, policy_network: nn.Module, optimizer: torch.optim.Optimizer, candidates_count: int = 200, elite_count: int | None = None, discount_factor: float = 0.99, apply_baseline: bool = True, random_action_mechanism: RandomActionMechanism = NoRandomActionMechanism(), random_generator: np.random.Generator | None = None): ¶

This constructor initializes an instance of the ReinforceAgent class.

Parameters
environment:`GraphEnvironment`	The RL environment defining the extremal problem and providing the graph building game, given as a `GraphEnvironment` object.
policy_network:`nn.Module`	The policy network used to compute the probability of each action in each episode and step, given as a `torch.nn.Module` object.
optimizer:`torch.optim.Optimizer`	The optimizer responsible for updating the model parameters, given as a `torch.optim.Optimizer` object. The parameters of `policy_network` must be passed to it.
candidates_count:`int`	A positive `int` specifying how many graphs are generated in each iteration by running the corresponding number of episodes in parallel. The default value is 200.
elite_count:`int \| None`	A positive `int` specifying how many episodes with the greatest graph invariant value are used to train the policy network in each iteration of the learning process, or `None` to indicate that all executed episodes should be used. The default value is `None`.
discount_factor:`float`	A `float` from the interval [0, 1] representing the discount factor to be used while computing the returns. The default value is 0.99.
apply_baseline:`bool`	A `bool` indicating whether a baseline should be applied to reduce variance. If `True`, the baseline is the mean return over all elite episodes, computed independently for each step. The default value is `True`.
random_action_mechanism:`RandomActionMechanism`	A `RandomActionMechanism` object that governs the probability of executing a random action in each step of the graph building game. When a random action is triggered, the agent ignores the action predicted by the policy network and instead selects an action uniformly at random among all available actions. By default, this is `NoRandomActionMechanism()`, meaning that no random actions are ever executed.
random_generator:`np.random.Generator \| None`	Either `None`, or a `numpy.random.Generator` used for probabilistic decisions. If `None`, a default generator will be used. The default value is `None`.

def reset(self): ¶

overrides rlgt.agents.graph_agent.GraphAgent.reset

This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.

def step(self): ¶

overrides rlgt.agents.graph_agent.GraphAgent.step

This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment. This iteration should update the agent's internal state and improve its policy or decision-making strategy based on the observed outcomes.

@property

best_graph: Graph | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.best_graph

This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a Graph object. Otherwise, if no iterations have been executed or the agent has not been initialized, the value None must be returned.

@property

best_score: float | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.best_score

This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a float. If the agent has been initialized but no iterations have yet been executed, the value −∞ must be returned. If the agent has not been initialized, the value None must be returned.

@property

step_count: int | None = ¶

overrides rlgt.agents.graph_agent.GraphAgent.step_count

This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative int. If the agent has not yet been initialized, the value None must be returned.

_apply_baseline: bool = ¶

A bool indicating whether a baseline should be applied to reduce variance. If True, the baseline is the mean return over all episodes, computed independently for each step.

_best_graph: Graph | None = ¶

A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed.

_best_score: float | None = ¶

A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized.

_candidates_count: int = ¶

A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel.

_device: torch.device = ¶

A torch.device object indicating the device where the model resides.

_discount_factor: float = ¶

A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns.

_elite_count: int | None = ¶

A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used.

_environment: GraphEnvironment = ¶

A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs.

_optimizer: torch.optim.Optimizer = ¶

A torch.optim.Optimizer object that updates the model parameters.

_policy_network: nn.Module = ¶

A torch.nn.Module object predicting the action probabilities for each step in each episode.

_population_returns: np.ndarray | None = ¶

Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where episode_length is the episode length of the RL environment and candidates_count is the number of episodes executed in parallel. The first dimension corresponds to the timestamps (actions) within an episode, and the second corresponds to the executed episodes.

_random_action_mechanism: RandomActionMechanism = ¶

A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state.

_random_generator: np.random.Generator = ¶

A numpy.random.Generator object used for all probabilistic decisions.

_step_count: int | None = ¶

A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized.