|
31 | 31 | **Actions**: Every piston can be acted on at each time step. In discrete mode, the action space is 0 to move down by 4 pixels, 1 to stay still, and 2 to move up by 4 pixels. In continuous mode, the value in the range [-1, 1] is proportional to the amount that the pistons
|
32 | 32 | are lowered or raised by. Continuous actions are scaled by a factor of 4, so that in both the discrete and continuous action space, the action 1 will move pistons 4 pixels up, and -1 will move pistons 4 pixels down.
|
33 | 33 |
|
34 |
| -**Rewards**: The same reward is provided to each agent based on how much the ball moved left in the last time-step plus a constant time-penalty. Specifically, there are three components to the distance reward. First, the x-distance in pixels travelled by the ball towards |
35 |
| -the left-wall in the last time-step (moving right would provide a negative reward). Second, a scaling factor of 100. Third, a division by the distance in pixels between the ball at the start of the time-step and the left-wall. That final division component means moving |
36 |
| -one unit left when close to the wall is far more valuable than moving one unit left when far from the wall. There is also a configurable time-penalty (default: -0.1) added to the distance-based reward at each time-step. For example, if the ball does not move in a |
37 |
| -time-step, the reward will be -0.1 not 0. This is to incentivize solving the game faster. |
| 34 | +**Rewards**: The same reward is provided to each agent based on how much the ball moved left in the last time-step (moving right results in a negative reward) plus a constant time-penalty. The distance component is the percentage of the initial total distance (i.e. at game-start) |
| 35 | +to the left-wall travelled in the past timestep. For example, if the ball began the game 300 pixels away from the wall, began the time-step 180 pixels away and finished the time-step 175 pixels away, the distance reward would be 100 * 5/300 = 1.7. There is also a configurable |
| 36 | +time-penalty (default: -0.1) added to the distance-based reward at each time-step. For example, if the ball does not move in a time-step, the reward will be -0.1 not 0. This is to incentivize solving the game faster. |
38 | 37 |
|
39 | 38 | Pistonball uses the chipmunk physics engine, and are thus the physics are about as realistic as in the game Angry Birds.
|
40 | 39 |
|
@@ -632,15 +631,15 @@ def step(self, action):
|
632 | 631 | # The negative one is included since the x-axis increases from left-to-right. And, if the x
|
633 | 632 | # position decreases we want the reward to be positive, since the ball would have gotten closer
|
634 | 633 | # to the left-wall.
|
635 |
| - global_reward = ( |
| 634 | + reward = ( |
636 | 635 | -1
|
637 | 636 | * (ball_curr_pos - self.ball_prev_pos)
|
638 | 637 | * (100 / self.distance_to_wall_at_game_start)
|
639 | 638 | )
|
640 | 639 | if not self.terminate:
|
641 |
| - global_reward += self.time_penalty |
| 640 | + reward += self.time_penalty |
642 | 641 |
|
643 |
| - self.rewards = {agent: global_reward for agent in self.agents} |
| 642 | + self.rewards = {agent: reward for agent in self.agents} |
644 | 643 | self.ball_prev_pos = ball_curr_pos
|
645 | 644 | self.frames += 1
|
646 | 645 | else:
|
|
0 commit comments