Why the reward is an integrand in RL example?

Hi,

I tried searching for better channels to ask this question and couldn't find a better fitting one. 
I have been experimenting with this repo a lot, implementing multiple distributed training variations and playing with observation and action spaces for the multizone office simple air test case. There is however one aspect of the examples that I can't still wrap my head around: The RL_train.py example uses a custom reward with an objective integrand:

```
def get_reward(self):
    '''Custom reward function
    
    '''
    
    # Compute BOPTEST core kpis
    kpis = requests.get('{0}/kpi/{1}'.format(self.url, self.testid)).json()['payload']
    
    # Calculate objective integrand function at this point
    objective_integrand = kpis['cost_tot']*12.*16. + 100*kpis['tdis_tot']
    
    # Compute reward
    reward = -(objective_integrand - self.objective_integrand)
    
    self.objective_integrand = objective_integrand
    
    return reward
```

I don't understand completely the reasoning behind this choice. Why is it better to calculate the difference between the current cost and the previous one, Doesn't this changes the naturally increasing nature of the reward signal during training making it harder to track if it improves over training steps or not? What is the advantage of using an integrand here?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why the reward is an integrand in RL example? #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why the reward is an integrand in RL example? #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions