Question on reward computation

Hi, thank you for releasing the code.

I have a problem on the computation of the reward. In the `compute_reward` function in `videogpt_reward_model.py`, for each transition $(s_t,a_t,s_{t+1})$, it seems that the variables `image_batch`, `encodings`, and `embeddings` correspond to $s_t$. Then it seems the reward $r_t(s_t,a_t)$ is computed as $log p({s_t|s_{1:t-1}})$ (when `reward_model_compute_joint` is set to `False`) and the sum from $log p({s_t|s_{1:t-1}})$ to $log p({s_{t-seqlen+1}|s_{1:t-seqlen}})$ (when `reward_model_compute_joint` is set to `True`), instead of the $log p(s_{t+1}|s_{1:t})$ stated in the paper. Do I miss any details that fix this issue, or is this exactly the empirical implementation of VIPER? Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on reward computation #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question on reward computation #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions