Possible bug of the calculation precision

Hello, Lucas
First, thanks for your code. Now I'm using your ppo algorithm for my project. But I found a possible issue of the number precision, which may cause the training process  diverged.

Here is the problem. **My original internal reward (each step) is small,  like 0.02, 0.001, and the max return is 0.08 for the environment.** Strangely, the training does not converge at all, and this problem is solved **by multiply the reward by a factor of ten**.

I used the OpenAI's baselines when I was using tensorflow. The precision was not a problem, and the small reward can also work well.  Now when I ported to Pytorch, The policy and the agent are basically the same (torch and TF). I do not know how to handle this possible issue. 


Here are the logs of the original reward and the magnified reward by a factor of ten. You can **check the mean return** of each of them.
**The first one is in the range of [0.0100, 0.012] and goes up and down.**
![reward_origin](https://user-images.githubusercontent.com/8980981/58095064-af0ba480-7c04-11e9-803b-707eb440f62d.png)

**Here is the magnified reward, and the return is increasing over time.**
![reward_magnified](https://user-images.githubusercontent.com/8980981/58095067-b337c200-7c04-11e9-883d-0015f9150f4e.png)


I wonder why the small reward can make the training stage fail, so I open this issue to discuss it.

Thanks in advance.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible bug of the calculation precision #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible bug of the calculation precision #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions