Skip to content

3.4.0版本的swift会过滤数据集,是什么因素导致? #4026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
llp1992 opened this issue Apr 28, 2025 · 3 comments
Open

3.4.0版本的swift会过滤数据集,是什么因素导致? #4026

llp1992 opened this issue Apr 28, 2025 · 3 comments

Comments

@llp1992
Copy link

llp1992 commented Apr 28, 2025

[INFO:swift] Dataset filtered, origin length: 1124869, filtered dataset length: 586472

swift版本号:3.4.0

3.4.0版本的swift会过滤数据集,是什么原因导致?2.6.0版本的swift不会过滤

@slin000111
Copy link
Collaborator

命令行参数--truncation_strategy delete,样本的tokens超过max_length会被删除。

@llp1992
Copy link
Author

llp1992 commented Apr 29, 2025

命令行参数--truncation_strategy delete,样本的tokens超过max_length会被删除。

不是超过max_length被删除的原因,而是dataset map处理的时候被delete的

@Jintao-Huang
Copy link
Collaborator

你看看 上面的报错信息

过滤时候会打印

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants