-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phasic Policy Gradient #106
Comments
Hi, @cvnad1 . That is not implemented in Pearl. If you do that, that would be much appreciated. Thank you. |
Hi, I have been trying to go through Pearl's functions and classes for the past few days. I was especially looking at the PPO implementation as PPG is heavily based on this except for a new auxiliary training phase. I noticed that PPO is using separate networks for policy and value whereas, both should share a common base with different heads. Am I wrong with this? Did I miss something? |
@cvnad1. I think it is fine to keep the policy and value networks separate. Do you have any specific concern about separating them? |
@yiwan-rl Crct me if I am wrong, but I believe in official PPO implementation by the majority of libraries, the network has a common base, a value head, and a policy head as this gives better results as compared to training them separately. Of course, it is not wrong to train them separately but this would result in poor performance. PPG exactly addresses this. The authors noticed that separating policy and value networks results in poor performance and keeping a common base results in noise and reduced sample efficiency. To get the optimum of both worlds, they proposed the PPG algorithms as detailed in the linked paper above. In PPG, we have two neural networks in total, Policy network -> common base + policy head + auxiliary value head There are two training phases, Training Phase -> Update Policy network + Update Value Network separately I just wanted to give a summary to you of how I am planning to implement PPG as PPG is just a modification to PPO with some additional losses and training updates. Again, I will be delighted to hear your thoughts or suggestions that can help me. |
Thanks for the explanation. To implement this idea, you could write a new history summarization module that implements the shared base network, similar to this LSTM module https://github.com/facebookresearch/Pearl/blob/main/pearl/history_summarization_modules/lstm_history_summarization_module.py. The history summarization module's goal is to produce a vector, based on past history, that represents the agent's current state, which is the input of the actor and the critic. I think the best place to implement this shared base is there. |
@yiwan-rl Will check the module and revert back in case of doubts. |
@rodrigodesalvobraz I would like to know whether Phasic Policy Gradient is implemented https://arxiv.org/abs/2009.04416. If it's not then I would like to try implementing it and add it to Pearl.
The text was updated successfully, but these errors were encountered: