Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phasic Policy Gradient #106

Open
cvnad1 opened this issue Oct 30, 2024 · 6 comments
Open

Phasic Policy Gradient #106

cvnad1 opened this issue Oct 30, 2024 · 6 comments

Comments

@cvnad1
Copy link

cvnad1 commented Oct 30, 2024

@rodrigodesalvobraz I would like to know whether Phasic Policy Gradient is implemented https://arxiv.org/abs/2009.04416. If it's not then I would like to try implementing it and add it to Pearl.

@rodrigodesalvobraz
Copy link
Contributor

Hi, @cvnad1 . That is not implemented in Pearl. If you do that, that would be much appreciated. Thank you.

@cvnad1
Copy link
Author

cvnad1 commented Nov 6, 2024

@rodrigodesalvobraz

Hi, I have been trying to go through Pearl's functions and classes for the past few days. I was especially looking at the PPO implementation as PPG is heavily based on this except for a new auxiliary training phase.

I noticed that PPO is using separate networks for policy and value whereas, both should share a common base with different heads. Am I wrong with this? Did I miss something?

@yiwan-rl
Copy link
Contributor

yiwan-rl commented Nov 6, 2024

@cvnad1. I think it is fine to keep the policy and value networks separate. Do you have any specific concern about separating them?

@cvnad1
Copy link
Author

cvnad1 commented Nov 6, 2024

@yiwan-rl Crct me if I am wrong, but I believe in official PPO implementation by the majority of libraries, the network has a common base, a value head, and a policy head as this gives better results as compared to training them separately.

Of course, it is not wrong to train them separately but this would result in poor performance. PPG exactly addresses this. The authors noticed that separating policy and value networks results in poor performance and keeping a common base results in noise and reduced sample efficiency. To get the optimum of both worlds, they proposed the PPG algorithms as detailed in the linked paper above.

In PPG, we have two neural networks in total,

Policy network -> common base + policy head + auxiliary value head
Value network -> Normal vanilla value network

There are two training phases,

Training Phase -> Update Policy network + Update Value Network separately
Auxiliary Phase _> Update auxiliary value network using Joint Loss + Update Value network again (Sampling efficiency)

I just wanted to give a summary to you of how I am planning to implement PPG as PPG is just a modification to PPO with some additional losses and training updates.

Again, I will be delighted to hear your thoughts or suggestions that can help me.

@yiwan-rl
Copy link
Contributor

yiwan-rl commented Nov 7, 2024

Thanks for the explanation. To implement this idea, you could write a new history summarization module that implements the shared base network, similar to this LSTM module https://github.com/facebookresearch/Pearl/blob/main/pearl/history_summarization_modules/lstm_history_summarization_module.py. The history summarization module's goal is to produce a vector, based on past history, that represents the agent's current state, which is the input of the actor and the critic. I think the best place to implement this shared base is there.

@cvnad1
Copy link
Author

cvnad1 commented Nov 7, 2024

@yiwan-rl Will check the module and revert back in case of doubts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants