grpo has not prm #79

yiyepiaoling0715 · 2024-10-05T15:12:52Z

grpo has the step level reward deal,also known as progress reward model,but not seen in the code, can you tell the reason or how to use step level deal ? thanks

garrett4wade · 2024-10-13T07:11:17Z

Sorry for the late reply.

PRM or ORM are similar. The current code here simply extracts scores at the end of each sentence. You can modify the model interface to utilize scores at all positions (or at step level, such as scores outputed at all "comma" tokens), just like how we use values in PPO.

This example may also be helpful.

We'd like to help you if you encounter any issues during implementation.

yiyepiaoling0715 · 2024-11-13T13:19:09Z

thanks,i will run on this repo

yiyepiaoling0715 · 2024-11-14T08:43:34Z

have seen the example from you ,it is about to use ppo to optimize sentiment class. in my knowledge and with other colleague communication, none has got the benefit so far. can you tell me it is just a example to learn,or you have got benefit to train the classification task with rl like ppo?

yiyepiaoling0715 · 2024-11-14T08:53:34Z

Sorry for the late reply.

PRM or ORM are similar. The current code here simply extracts scores at the end of each sentence. You can modify the model interface to utilize scores at all positions (or at step level, such as scores outputed at all "comma" tokens), just like how we use values in PPO.

This example may also be helpful.

We'd like to help you if you encounter any issues during implementation.
about this case
https://github.com/openpsi-project/ReaLHF/blob/main/examples/customized_exp/ppo_sentiment.py

garrett4wade · 2024-11-15T01:08:55Z

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

yiyepiaoling0715 · 2024-11-16T03:13:22Z

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

thanks

yiyepiaoling0715 · 2024-11-16T03:14:10Z

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

thanks

can you build a wechat group for tech,and let me in ? my: yiyepiaoling0715

garrett4wade · 2024-11-20T06:27:02Z

Sorry for the late reply. Just requested in wechat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpo has not prm #79

grpo has not prm #79

yiyepiaoling0715 commented Oct 5, 2024

garrett4wade commented Oct 13, 2024

yiyepiaoling0715 commented Nov 13, 2024

yiyepiaoling0715 commented Nov 14, 2024

yiyepiaoling0715 commented Nov 14, 2024

garrett4wade commented Nov 15, 2024

yiyepiaoling0715 commented Nov 16, 2024

yiyepiaoling0715 commented Nov 16, 2024

garrett4wade commented Nov 20, 2024

grpo has not prm #79

grpo has not prm #79

Comments

yiyepiaoling0715 commented Oct 5, 2024

garrett4wade commented Oct 13, 2024

yiyepiaoling0715 commented Nov 13, 2024

yiyepiaoling0715 commented Nov 14, 2024

yiyepiaoling0715 commented Nov 14, 2024

garrett4wade commented Nov 15, 2024

yiyepiaoling0715 commented Nov 16, 2024

yiyepiaoling0715 commented Nov 16, 2024

garrett4wade commented Nov 20, 2024