Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大量偏好数据集无法正常使用 #1795

Open
AIR-hl opened this issue Nov 2, 2024 · 2 comments
Open

大量偏好数据集无法正常使用 #1795

AIR-hl opened this issue Nov 2, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@AIR-hl
Copy link

AIR-hl commented Nov 2, 2024

Describe the bug/ 问题描述
目前正在测试mindnlp.trl,但大量偏好数据集正常加载后无法读取,已经尝试过的数据集包括:trl-internal-testing/hh-rlhf-helpful-base-trl-style, HuggingFaceH4/ultrafeedback_binarized, argilla/ultrafeedback-binarized-preferences-cleaned, princeton-nlp/llama3-ultrafeedback-armorm, argilla/distilabel-intel-orca-dpo-pairsIntel/orca_dpo_pairs . . .

目前尝试过的仅有 Anthropic/hh-rlhf 能够正常打印

Software Environment / 软件环境:
--os:WSL Ubuntu-22.04
--python 3.10.14
--mindspore 2.3.1
--mindnlp 0.4.1
--numpy 1.26.3

To Reproduce / 重现步骤

from mindnlp.dataset import load_dataset

ds=load_dataset("HuggingFaceH4/ultrafeedback_binarized", split="train_prefs")
print(next(ds.create_dict_iterator()))

Screenshots/ 日志 / 截图
image

@AIR-hl AIR-hl added the bug Something isn't working label Nov 2, 2024
@AIR-hl AIR-hl changed the title 数据集无法正常使用 大量偏好数据集无法正常使用 Nov 4, 2024
@lvyufeng
Copy link
Collaborator

lvyufeng commented Nov 6, 2024

确定是GeneratorDataset的问题,我先去和minddata同事交流下

@AIR-hl
Copy link
Author

AIR-hl commented Nov 26, 2024

确定是GeneratorDataset的问题,我先去和minddata同事交流下

您好,请问此Bug是否有修复计划

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants