Skip to content

question: how is the gradient of the log probs calculated? #1

Open
@letitfly

Description

@letitfly

hi Umar, What an awesome free lecture and I cannot thank you enough for your service to all of us developers!

Sorry that I have to borrow this place for a question. In slides "RLHF and PPO" page 17. It is said "This is an expectation, which means we can approximate it with a sample mean by collecting a set D of trajectories.".

As my current understanding, we sample the trajectories but what we get is the log probs. My question is how do we go from there to calculate the gradient of the lob probs?

Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions