Skip to content

Why the reward is an integrand in RL example? #163

@SebsCubs

Description

@SebsCubs

Hi,

I tried searching for better channels to ask this question and couldn't find a better fitting one.
I have been experimenting with this repo a lot, implementing multiple distributed training variations and playing with observation and action spaces for the multizone office simple air test case. There is however one aspect of the examples that I can't still wrap my head around: The RL_train.py example uses a custom reward with an objective integrand:

def get_reward(self):
    '''Custom reward function
    
    '''
    
    # Compute BOPTEST core kpis
    kpis = requests.get('{0}/kpi/{1}'.format(self.url, self.testid)).json()['payload']
    
    # Calculate objective integrand function at this point
    objective_integrand = kpis['cost_tot']*12.*16. + 100*kpis['tdis_tot']
    
    # Compute reward
    reward = -(objective_integrand - self.objective_integrand)
    
    self.objective_integrand = objective_integrand
    
    return reward

I don't understand completely the reasoning behind this choice. Why is it better to calculate the difference between the current cost and the previous one, Doesn't this changes the naturally increasing nature of the reward signal during training making it harder to track if it improves over training steps or not? What is the advantage of using an integrand here?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions