Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpo has not prm #79

Open
yiyepiaoling0715 opened this issue Oct 5, 2024 · 8 comments
Open

grpo has not prm #79

yiyepiaoling0715 opened this issue Oct 5, 2024 · 8 comments

Comments

@yiyepiaoling0715
Copy link

grpo has the step level reward deal,also known as progress reward model,but not seen in the code, can you tell the reason or how to use step level deal ? thanks

@garrett4wade
Copy link
Contributor

Sorry for the late reply.

PRM or ORM are similar. The current code here simply extracts scores at the end of each sentence. You can modify the model interface to utilize scores at all positions (or at step level, such as scores outputed at all "comma" tokens), just like how we use values in PPO.

This example may also be helpful.

We'd like to help you if you encounter any issues during implementation.

@yiyepiaoling0715
Copy link
Author

thanks,i will run on this repo

@yiyepiaoling0715
Copy link
Author

have seen the example from you ,it is about to use ppo to optimize sentiment class. in my knowledge and with other colleague communication, none has got the benefit so far. can you tell me it is just a example to learn,or you have got benefit to train the classification task with rl like ppo?

@yiyepiaoling0715
Copy link
Author

Sorry for the late reply.

PRM or ORM are similar. The current code here simply extracts scores at the end of each sentence. You can modify the model interface to utilize scores at all positions (or at step level, such as scores outputed at all "comma" tokens), just like how we use values in PPO.

This example may also be helpful.

We'd like to help you if you encounter any issues during implementation.
about this case
https://github.com/openpsi-project/ReaLHF/blob/main/examples/customized_exp/ppo_sentiment.py

@garrett4wade
Copy link
Contributor

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

@yiyepiaoling0715
Copy link
Author

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

thanks

@yiyepiaoling0715
Copy link
Author

It's just an example to learn. You can customize the interface to do what you want to do, either training the PRM or using the PRM for PPO. We feel sorry that we don't have the bandwidth to provide all reference implementations.

thanks

can you build a wechat group for tech,and let me in ? my: yiyepiaoling0715

@garrett4wade
Copy link
Contributor

Sorry for the late reply. Just requested in wechat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants