Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] One-time evaluation score is not good indicator, and therefore probably should not be the tuning target? #314

Closed
5 tasks done
CyclonicDyna opened this issue Nov 16, 2022 · 3 comments
Labels
duplicate This issue or pull request already exists question Further information is requested

Comments

@CyclonicDyna
Copy link

CyclonicDyna commented Nov 16, 2022

❓ Question

I tried to train the agent with the tuned parameters from the best trial, but strangely, I found out the training cannot be as good as the reported score in the tuning process (actually, much worse). Anyone met this problem?

My guess for this problem is that the one-time evaluation is very stochastic (question: currently, sb3-rl-zoo is using one-time evaluation score as the target, right?). I think It is not a good measurement of how good the agent is.
Probably we should use the mean of multi-time evaluation scores as the tuning target, or using the mean reward from the rollout. What do you think?
evaluation not a good indicator
(for mountain car)

Thanks for any ideas!

Checklist

@CyclonicDyna CyclonicDyna added the question Further information is requested label Nov 16, 2022
@qgallouedec
Copy link
Collaborator

Duplicate #204
Contributions are welcomed 🙂

@qgallouedec qgallouedec added the duplicate This issue or pull request already exists label Nov 16, 2022
@araffin
Copy link
Member

araffin commented Nov 16, 2022

Related to #286 too.

In the meantime, you can also increase the number of evaluations episodes to reduce the noise (--eval-episodes).

My guess for this problem is that the one-time evaluation is very stochastic (question: currently, sb3-rl-zoo is using one-time evaluation score as the target, right?).

not really, if you are using a pruner, it evaluates the agent periodically on several test episodes, but yes we current test only one seed.
To have hyperparameters that work for many random seeds, you can do a post-processing step (see #151) or run multiple training (#204).

@CyclonicDyna
Copy link
Author

Thank you very much @araffin @qgallouedec. I learned a lot from your answers and from the related Q&A : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants