Skip to content

老是会因为各种各样的原因停止训练,有什么办法能够训练中断后重启训练吗 #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MichealZhangxa opened this issue Dec 5, 2024 · 3 comments

Comments

@MichealZhangxa
Copy link

有什么办法重启训练吗,接着把没训练完的数据训练完,我发现按照步数保存得到的权重我不知道怎么使用

@ZhangXJ199
Copy link
Collaborator

目前的代码只有在训练全部结束后才会保存权重

@depixels
Copy link

depixels commented Mar 6, 2025

更改finetune.sh里面的--save_steps 为10000 save num limit改为10 中断后,在trainer.train设置resume_from_pretrain(大概是这么个参数名)=${check_point_path}

@williamium3000
Copy link

williamium3000 commented May 10, 2025

似乎会有
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options
(1) Re-running torch.load with weights_only set to False will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with weights_only=True please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use torch.serialization.add_safe_globals([_reconstruct]) to allowlist this global if you trust this class/function.
的bug
有人adapt成功了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants