Open
Description
hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!
Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".
As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?
Thanks in advance!
Metadata
Metadata
Assignees
Labels
No labels