-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi,
I tried searching for better channels to ask this question and couldn't find a better fitting one.
I have been experimenting with this repo a lot, implementing multiple distributed training variations and playing with observation and action spaces for the multizone office simple air test case. There is however one aspect of the examples that I can't still wrap my head around: The RL_train.py example uses a custom reward with an objective integrand:
def get_reward(self):
'''Custom reward function
'''
# Compute BOPTEST core kpis
kpis = requests.get('{0}/kpi/{1}'.format(self.url, self.testid)).json()['payload']
# Calculate objective integrand function at this point
objective_integrand = kpis['cost_tot']*12.*16. + 100*kpis['tdis_tot']
# Compute reward
reward = -(objective_integrand - self.objective_integrand)
self.objective_integrand = objective_integrand
return reward
I don't understand completely the reasoning behind this choice. Why is it better to calculate the difference between the current cost and the previous one, Doesn't this changes the naturally increasing nature of the reward signal during training making it harder to track if it improves over training steps or not? What is the advantage of using an integrand here?
Thanks!