-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented CrossQ #243
Implemented CrossQ #243
Conversation
@araffin in my initial PR it seams like one code style check was failing, sorry about that. I fixed it and it passes on my machine now. I hope it will go through now :) |
.. autosummary:: | ||
:nosignatures: | ||
|
||
MlpPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add at least the multi input policy? (so we can try it in combination with HER)
Only the feature extractor should be changed normally.
And what do you think about adding CnnPolicy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I looked into it and have not yet added it. If I am not mistaken this would also require some changes to the CrossQ train() function. Since, now concatenating and splitting the batches would also require some control flow based on the used policy.
For simplicity sake (for now) and since I did not have time to try and evaluate the multi input policy I did not add that yet.
sb3_contrib/crossq/crossq.py
Outdated
|
||
with th.no_grad(): | ||
# Select action according to policy | ||
self.actor.set_training_mode(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that needed? self.actor.set_training_mode(False)
is already set above?
or you meant self.actor.set_training_mode(True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more mode calls than needed. The reason was, that I wanted to be very specific which more needs to be used where. I think using the wrong BN mode is one of the big gotchas and sources of error when implementing CrossQ. Since this here should be a PyTorch reference to aid others when they want to implement it by themselves I think it is helpful to make the mode very specific to clear up possible confusion.
Thanks a lot for the implementation =) I'll try later in the week, but how is it in term of runtime? (SAC vs CrossQ in PyTorch) |
No worries :) I just pushed most things you requested. I'll add some more specific responses directly to the questions above.
It seems to be quite a but slower than the SAC baseline (and the JAX implementation as well). |
I'm suspecting something is wrong with the current implementation (I'm currently investigating if it is my changes or not). BipedalWalker-v3:
n_timesteps: !!float 2e5
policy: 'MlpPolicy'
buffer_size: 300000
gamma: 0.98
learning_starts: 10000
policy_kwargs: "dict(net_arch=dict(pi=[256, 256], qf=[1024, 1024]))" With the RL Zoo cli for both SBX and SB3 (see SBX readme to have support)
I'm getting much better results with SBX... |
Did you figure out what the issue is? I was at ICRA until last week so I didn't have time but if you didn't find it yet I can also have a look. Before I pushed my last commit I benchmarked it and there the results looked as expected. |
sb3_contrib/crossq/crossq.py
Outdated
# which behave differently in train and eval modes. | ||
|
||
self.actor.set_training_mode(True) | ||
actions_pi, log_prob = self.actor.action_log_prob(replay_data.observations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor but still a difference with original implementation: the batch norm stats for the actor are updated at every training step?
(in other word, the delay parameter is not taken into account?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, that's true. I guess the easiest fix for this would be to set self.actor.set_training_mode(update_actor_and_temperature)
.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i reorganized things and now it should be as in the original implementation
I simplified the network creation (need DLR-RM/stable-baselines3#1975 to be merged with master), added the updated beta for Adam (it had an impact on my small experiments with Pendulum) and fixed a wrong default value for BN momentum that I introduced. |
Sorry for the delay, I finally had some time to run more experiments and it looks good =): https://wandb.ai/openrlbenchmark/sb3-contrib/reports/SB3-Contrib-CrossQ--Vmlldzo4NTE2MTEx Now, we need to update the doc and it should be ready to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks again =)
This PR implements CrossQ (https://openreview.net/pdf?id=PczQtTsTIX), a novel off-policy deep RL algorithm that carefully uses batch normalisation and removes target networks to achieve state-of-the-art sample efficiency at a much lower computational complexity, as it does not require large update-to-data-ratios.
Description
This implementation is a PyTorch implementation based on the original JAX implementation (https://github.com/adityab/CrossQ).
The following plot shows that the performance matches the performance reported in the original paper, as well as the performance of the open source SBX implementation provided by the authors (evaluated on 10 seeds).
Open RL benchmark report: https://wandb.ai/openrlbenchmark/sb3-contrib/reports/SB3-Contrib-CrossQ--Vmlldzo4NTE2MTEx
Context
closes [Feature Request] Implement CrossQ #238
Types of changes
Checklist:
make format
(required)make check-codestyle
andmake lint
(required)make pytest
andmake type
both pass. (required)Note: we are using a maximum length of 127 characters per line