Skip to content
HBP1969 edited this page Oct 10, 2024 · 45 revisions

This Wiki is merely a notebook from the author, rather than the "ultimate truth". Suggestions, remarks and corrections by raising an Issue are more than welcome!

Features of the model

The model demonstrates for instance:

  • The environment is a bounded grid world and the agents move within a Von Neumann neighborhood.
  • Restricted to one similar agent type per cell
  • Three agent types: Predator, Prey and Grass
  • Two learning agent types: Predator and Prey, learning to move in a Von Neumann neighborhood
  • Learning agents have partially observations of the entire model state; Prey can see farther than Predators in the baseline
  • Learned behavior of Predators and Prey as such to avoid being eaten or starving to death
  • Predators and Prey lose energy due to movement and homeostasis
  • Grass gains energy due to photosynthesis
  • Dynamically removing agents from the grid when eaten (Prey and Grass) or starving to death (Predator and Prey)
  • Dynamically adding agents to the grid when Predators or Prey gain sufficient energy to reproduce asexually
  • Optionally,a rectangular spawning area per agent type within the grid world can be specified and narrowed down
  • Part of the energy of the parent is transferred to the child if reproduction occurs
  • Grass is removed from the grid world after being eaten by prey, but regrows at the same spot after a certain number of steps
  • Episode ends when either all Predators or all Prey are dead

PettingZoo's last() function and the handling of rewards by the PPO algorithm

In the case of environments following the Farama Gymnasium interface, which is a common standard, the step method is used to advance the environment's state. The step method takes an action as input and returns four values: the new observation (state), the reward, a boolean indicating whether an agent has terminated, a boolean when the episode was truncated, and additional info. The order of these return values is fixed, so the agent knows that the second value is the reward.

Here's a simplified example:

observation, reward, done, info = env.step(action)

In this line of code, reward is the reward for the action taken. The variable name doesn't matter; what matters is the position of the returned value in the tuple.

So, the PPO algorithm (or any reinforcement learning algorithm) doesn't need to know the variable name in the environment that represents the reward. It just needs to know the structure of the data returned by the environment's step method.

PPO in MARL

Proximal Policy Optimizat### Design restrictions and workarounds for the environment and the PPO algorithm

The AEC Environment Architecture

Due to unexpected behavior when agents terminate during a simulation in PettingZoo AEC (https://github.com/Farama-Foundation/PettingZoo/issues/713), we modified the architecture. The 'AECEnv.agents' array remains unchanged after agent death or creation. The removal of agents is managed by 'PredPrey.predator_instance_list' and PredPreyGrass.[predator/prey]_instance_list. The active status of agents is furthermore tracked by the boolean attribute alive of the agents. Optionally, a number of agents have this attribute alive set to False at reset, which gives room for creation of agents during run time. If so, the agents are (re)added to PredPreyGrass.[predator/prey]_instance_list.

This architecture provides an alternative to the unexpected behavior of individual agents terminating during simulation in the standard PettingZoo API and circumvents the PPO-algorithm's requirement of an unchanged number of agents during training. In that sense it is comparable to SuperSuit's "Black Death" wrapper.

Clone this wiki locally