question: how is the gradient of the log probs calculated?

hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!

Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".

As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

question: how is the gradient of the log probs calculated? #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

question: how is the gradient of the log probs calculated? #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions