-
Notifications
You must be signed in to change notification settings - Fork 189
Description
Hello, Lucas
First, thanks for your code. Now I'm using your ppo algorithm for my project. But I found a possible issue of the number precision, which may cause the training process diverged.
Here is the problem. My original internal reward (each step) is small, like 0.02, 0.001, and the max return is 0.08 for the environment. Strangely, the training does not converge at all, and this problem is solved by multiply the reward by a factor of ten.
I used the OpenAI's baselines when I was using tensorflow. The precision was not a problem, and the small reward can also work well. Now when I ported to Pytorch, The policy and the agent are basically the same (torch and TF). I do not know how to handle this possible issue.
Here are the logs of the original reward and the magnified reward by a factor of ten. You can check the mean return of each of them.
The first one is in the range of [0.0100, 0.012] and goes up and down.
Here is the magnified reward, and the return is increasing over time.
I wonder why the small reward can make the training stage fail, so I open this issue to discuss it.
Thanks in advance.