Skip to content

Question on reward computation #5

Open
@NagisaZj

Description

@NagisaZj

Hi, thank you for releasing the code.

I have a problem on the computation of the reward. In the compute_reward function in videogpt_reward_model.py, for each transition $(s_t,a_t,s_{t+1})$, it seems that the variables image_batch, encodings, and embeddings correspond to $s_t$. Then it seems the reward $r_t(s_t,a_t)$ is computed as $log p({s_t|s_{1:t-1}})$ (when reward_model_compute_joint is set to False) and the sum from $log p({s_t|s_{1:t-1}})$ to $log p({s_{t-seqlen+1}|s_{1:t-seqlen}})$ (when reward_model_compute_joint is set to True), instead of the $log p(s_{t+1}|s_{1:t})$ stated in the paper. Do I miss any details that fix this issue, or is this exactly the empirical implementation of VIPER? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions