class PPOAgent(GraphAgent):
Constructor: PPOAgent(environment, policy_network, value_network, optimizer, ...)
This class encapsulates a reinforcement learning agent for graph theory applications using the
PyTorch-based Proximal Policy Optimization (PPO) method. The agent operates on a
configurable environment given as a GraphEnvironment object. In each iteration of the
learning process, the agent generates a predetermined number of graphs by playing the graph
building game defined by the environment and computes the graph invariant values and all
discounted returns for each episode run in parallel. Here, while computing a discounted return,
a reward is considered to be the increase between two consecutive graph invariant values. The
agent uses an actor-critic architecture, with a torch.nn.Module model (policy network)
applied to compute the probability of selecting each action in each step of every episode, and
another torch.nn.Module model (value network) applied to estimate the value that quantifies
the desirability of each state. Afterwards, the log probabilities and discounted returns of a
subset of top-performing episodes are used to train both models according to the PPO algorithm.
The training is performed over a configurable number of epochs. This completes one iteration of
the learning process. The user provides both models, configures the optimizer, sets the
discount factor, selects the number of epochs to be executed, configures the clamping epsilon
coefficient and the value loss coefficient, and optionally provides a random action mechanism.
When a random action occurs, it is selected uniformly among all actions available in the
current state.
| Method | __init__ |
This constructor initializes an instance of the PPOAgent class. |
| Method | reset |
This abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch. |
| Method | step |
This abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment... |
| Property | best |
This abstract property must be implemented by any concrete subclass. It must return a graph attaining the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the result must be returned as a ... |
| Property | best |
This abstract property must be implemented by any concrete subclass. It must return the best value of the target graph invariant achieved so far. If at least one learning iteration has been executed, the value is returned as a ... |
| Property | step |
This abstract property must be implemented by any concrete subclass. It must return the number of learning iterations executed so far. If the agent has been initialized, the returned value must be a nonnegative ... |
| Instance Variable | _best |
A Graph object representing a graph attaining the best achieved value for the graph invariant, or None if the agent has not been initialized or no iterations have been executed. |
| Instance Variable | _best |
A float representing the best achieved value for the graph invariant, or None if the agent has not been initialized. |
| Instance Variable | _candidates |
A positive int specifying the number of graphs constructed per iteration, i.e., the number of episodes run in parallel. |
| Instance Variable | _clamp |
A float from the interval [0, 1] representing the clamping epsilon coefficient used while computing the policy loss in each epoch of a learning iteration. |
| Instance Variable | _device |
A torch.device object indicating the device where the models reside. |
| Instance Variable | _discount |
A float from the interval [0, 1] representing the discount factor to be used while computing the discounted returns. |
| Instance Variable | _elite |
A positive int specifying the number of top-performing episodes used to train the model in each iteration, or None if all the episodes should be used. |
| Instance Variable | _environment |
A GraphEnvironment object defining the extremal problem and providing the graph building game used to construct all the graphs. |
| Instance Variable | _epochs |
A positive int specifying the number of epochs executed in each learning iteration of the PPO method. |
| Instance Variable | _optimizer |
A torch.optim.Optimizer object that updates the parameters of both models. |
| Instance Variable | _policy |
A torch.nn.Module object predicting the action probabilities for each step in each episode. |
| Instance Variable | _population |
Either None if uninitialized, or a numpy.ndarray of type numpy.int32 storing all actions during each episode trajectory. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the action trajectory within an episode and the second to the executed episodes... |
| Instance Variable | _population |
Either None if uninitialized, or a numpy.ndarray of type numpy.float32 storing the discounted returns for all executed episodes. Its shape is (episode_length, candidates_count), where the first dimension corresponds to the timestamps (actions) within an episode and the second to the executed episodes... |
| Instance Variable | _population |
Either None if uninitialized, or a numpy.ndarray storing all states during each episode trajectory. Its shape is (episode_length + 1,
candidates_count, state_length), where episode_length is the episode length of the RL environment, ... |
| Instance Variable | _random |
A RandomActionMechanism object that determines the probability of executing a random action. When a random action is selected, it is sampled uniformly among all available actions in the current state. |
| Instance Variable | _random |
A numpy.random.Generator object used for all probabilistic decisions. |
| Instance Variable | _step |
A nonnegative int representing the number of executed iterations, or None if the agent has not been initialized. |
| Instance Variable | _value |
A positive float representing the coefficient that scales the value loss while computing the total loss. |
| Instance Variable | _value |
A function implementing the MSE loss used for training the value network. |
| Instance Variable | _value |
A torch.nn.Module object predicting the state values for each step in each episode. |
GraphEnvironment, policy_network: nn.Module, value_network: nn.Module, optimizer: torch.optim.Optimizer, candidates_count: int = 200, elite_count: int | None = None, discount_factor: float = 0.99, epochs_count: int = 4, clamp_epsilon: float = 0.2, value_loss_coef: float = 0.5, random_action_mechanism: RandomActionMechanism = NoRandomActionMechanism(), random_generator: np.random.Generator | None = None):
¶
This constructor initializes an instance of the PPOAgent class.
| Parameters | |
environment:GraphEnvironment | The RL environment defining the extremal problem and providing the
graph building game, given as a GraphEnvironment object. |
policynn.Module | The policy network used to compute the probability of each action in
each episode and step, given as a torch.nn.Module object. |
valuenn.Module | The value network used to compute the state value of each episode and
step, given as a torch.nn.Module object. This value and policy networks must reside
in the same device. |
optimizer:torch.optim.Optimizer | The optimizer responsible for updating the parameters of both models,
given as a torch.optim.Optimizer object. The parameters of both policy_network
and value_network must be passed to it. |
candidatesint | A positive int specifying how many graphs are generated in each
iteration by running the corresponding number of episodes in parallel. The default
value is 200. |
eliteint | None | A positive int specifying how many episodes with the greatest graph
invariant value are used to train the policy network and value network in each
iteration of the learning process, or None to indicate that all executed episodes
should be used. The default value is None. |
discountfloat | A float from the interval [0, 1] representing the discount factor
to be used while computing the returns. The default value is 0.99. |
epochsint | A positive int specifying the number of epochs executed in each
learning iteration of the PPO method. The default value is 4. |
clampfloat | A positive float from the interval [0, 1] representing the clamping
epsilon coefficient used while computing the policy loss in each epoch of a learning
iteration. The default value is 0.2. |
valuefloat | A positive float representing the coefficient that scales the
value loss while computing the total loss. The default value is 0.5. |
randomRandomActionMechanism | A RandomActionMechanism object that governs the
probability of executing a random action in each step of the graph building game. When
a random action is triggered, the agent ignores the action predicted by the policy
network and instead selects an action uniformly at random among all available actions.
By default, this is NoRandomActionMechanism(), meaning that no random actions are
ever executed. |
randomnp.random.Generator | None | Either None, or a numpy.random.Generator used for
probabilistic decisions. If None, a default generator will be used. The default value
is None. |
rlgt.agents.graph_agent.GraphAgent.resetThis abstract method must be implemented by any concrete subclass. It must initialize the agent and prepare it to begin the learning process. If the agent has been used previously, invoking this method must reset all internal state so that the learning restarts from scratch.
rlgt.agents.graph_agent.GraphAgent.stepThis abstract method must be implemented by any concrete subclass. It must perform a single iteration of the learning process, which may involve one or more interactions between the agent and the environment. This iteration should update the agent's internal state and improve its policy or decision-making strategy based on the observed outcomes.
This abstract property must be implemented by any concrete subclass. It must return a graph
attaining the best value of the target graph invariant achieved so far. If at least one
learning iteration has been executed, the result must be returned as a Graph object.
Otherwise, if no iterations have been executed or the agent has not been initialized, the
value None must be returned.
This abstract property must be implemented by any concrete subclass. It must return the
best value of the target graph invariant achieved so far. If at least one learning
iteration has been executed, the value is returned as a float. If the agent has been
initialized but no iterations have yet been executed, the value −∞ must be returned. If the
agent has not been initialized, the value None must be returned.
A positive int specifying the number of graphs constructed per
iteration, i.e., the number of episodes run in parallel.
A float from the interval [0, 1] representing the clamping epsilon
coefficient used while computing the policy loss in each epoch of a learning iteration.
A float from the interval [0, 1] representing the discount factor to
be used while computing the discounted returns.
A GraphEnvironment object defining the extremal problem and providing the
graph building game used to construct all the graphs.
A positive int specifying the number of epochs executed in each learning
iteration of the PPO method.
Either None if uninitialized, or a numpy.ndarray of type
numpy.int32 storing all actions during each episode trajectory. Its shape is
(episode_length, candidates_count), where the first dimension corresponds to the action
trajectory within an episode and the second to the executed episodes. The episode order
matches _population_states.
Either None if uninitialized, or a numpy.ndarray of type
numpy.float32 storing the discounted returns for all executed episodes. Its shape is
(episode_length, candidates_count), where the first dimension corresponds to the
timestamps (actions) within an episode and the second to the executed episodes. The episode
order matches _population_states.
Either None if uninitialized, or a numpy.ndarray storing all
states during each episode trajectory. Its shape is (episode_length + 1,
candidates_count, state_length), where episode_length is the episode length of the RL
environment, state_length is the length of the state vectors, and candidates_count
is the number of episodes executed in parallel. The first dimension corresponds to the
state trajectory within an episode, the second to the executed episodes, and the third to
the state vector entries.
A RandomActionMechanism object that determines the
probability of executing a random action. When a random action is selected, it is sampled
uniformly among all available actions in the current state.
A positive float representing the coefficient that scales the value
loss while computing the total loss.