You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to train the agent with the tuned parameters from the best trial, but strangely, I found out the training cannot be as good as the reported score in the tuning process (actually, much worse). Anyone met this problem?
My guess for this problem is that the one-time evaluation is very stochastic (question: currently, sb3-rl-zoo is using one-time evaluation score as the target, right?). I think It is not a good measurement of how good the agent is.
Probably we should use the mean of multi-time evaluation scores as the tuning target, or using the mean reward from the rollout. What do you think?
(for mountain car)
Thanks for any ideas!
Checklist
I have checked that there is no similar issue in the repo
In the meantime, you can also increase the number of evaluations episodes to reduce the noise (--eval-episodes).
My guess for this problem is that the one-time evaluation is very stochastic (question: currently, sb3-rl-zoo is using one-time evaluation score as the target, right?).
not really, if you are using a pruner, it evaluates the agent periodically on several test episodes, but yes we current test only one seed.
To have hyperparameters that work for many random seeds, you can do a post-processing step (see #151) or run multiple training (#204).
❓ Question
I tried to train the agent with the tuned parameters from the best trial, but strangely, I found out the training cannot be as good as the reported score in the tuning process (actually, much worse). Anyone met this problem?
My guess for this problem is that the one-time evaluation is very stochastic (question: currently, sb3-rl-zoo is using one-time evaluation score as the target, right?). I think It is not a good measurement of how good the agent is.
Probably we should use the mean of multi-time evaluation scores as the tuning target, or using the mean reward from the rollout. What do you think?
(for mountain car)
Thanks for any ideas!
Checklist
The text was updated successfully, but these errors were encountered: