完成issue5 探索基于COT的奖励模型不同训练方案的优劣 by Ni-Songhao · Pull Request #19 · Tencent-Hunyuan/UnifiedReward

Ni-Songhao · 2025-07-30T14:02:27Z

所有流程已跑通，验证了DPO的可行性。但是受限于对资源的需求过大，所以只使用了部分数据验证有效性，后续会完成对所有数据的处理，完成验证和ruler-based GRPO进行对比。

Ni-Songhao added 2 commits July 30, 2025 21:33

complete issue 5 add DPO code

766c180

complete issue5

5c3c8a2

Provide feedback