Phasic Policy Gradient #106

cvnad1 · 2024-10-30T00:58:42Z

@rodrigodesalvobraz I would like to know whether Phasic Policy Gradient is implemented https://arxiv.org/abs/2009.04416. If it's not then I would like to try implementing it and add it to Pearl.

rodrigodesalvobraz · 2024-10-30T01:00:46Z

Hi, @cvnad1 . That is not implemented in Pearl. If you do that, that would be much appreciated. Thank you.

cvnad1 · 2024-11-06T20:33:05Z

@rodrigodesalvobraz

Hi, I have been trying to go through Pearl's functions and classes for the past few days. I was especially looking at the PPO implementation as PPG is heavily based on this except for a new auxiliary training phase.

I noticed that PPO is using separate networks for policy and value whereas, both should share a common base with different heads. Am I wrong with this? Did I miss something?

yiwan-rl · 2024-11-06T22:50:50Z

@cvnad1. I think it is fine to keep the policy and value networks separate. Do you have any specific concern about separating them?

cvnad1 · 2024-11-06T23:02:22Z

@yiwan-rl Crct me if I am wrong, but I believe in official PPO implementation by the majority of libraries, the network has a common base, a value head, and a policy head as this gives better results as compared to training them separately.

Of course, it is not wrong to train them separately but this would result in poor performance. PPG exactly addresses this. The authors noticed that separating policy and value networks results in poor performance and keeping a common base results in noise and reduced sample efficiency. To get the optimum of both worlds, they proposed the PPG algorithms as detailed in the linked paper above.

In PPG, we have two neural networks in total,

Policy network -> common base + policy head + auxiliary value head
Value network -> Normal vanilla value network

There are two training phases,

Training Phase -> Update Policy network + Update Value Network separately
Auxiliary Phase _> Update auxiliary value network using Joint Loss + Update Value network again (Sampling efficiency)

I just wanted to give a summary to you of how I am planning to implement PPG as PPG is just a modification to PPO with some additional losses and training updates.

Again, I will be delighted to hear your thoughts or suggestions that can help me.

yiwan-rl · 2024-11-07T04:21:45Z

Thanks for the explanation. To implement this idea, you could write a new history summarization module that implements the shared base network, similar to this LSTM module https://github.com/facebookresearch/Pearl/blob/main/pearl/history_summarization_modules/lstm_history_summarization_module.py. The history summarization module's goal is to produce a vector, based on past history, that represents the agent's current state, which is the input of the actor and the critic. I think the best place to implement this shared base is there.

cvnad1 · 2024-11-07T04:59:04Z

@yiwan-rl Will check the module and revert back in case of doubts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phasic Policy Gradient #106

Phasic Policy Gradient #106

cvnad1 commented Oct 30, 2024

rodrigodesalvobraz commented Oct 30, 2024

cvnad1 commented Nov 6, 2024

yiwan-rl commented Nov 6, 2024

cvnad1 commented Nov 6, 2024

yiwan-rl commented Nov 7, 2024

cvnad1 commented Nov 7, 2024

Phasic Policy Gradient #106

Phasic Policy Gradient #106

Comments

cvnad1 commented Oct 30, 2024

rodrigodesalvobraz commented Oct 30, 2024

cvnad1 commented Nov 6, 2024

yiwan-rl commented Nov 6, 2024

cvnad1 commented Nov 6, 2024

yiwan-rl commented Nov 7, 2024

cvnad1 commented Nov 7, 2024