You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
MindSpore2.2.14
mindnlp0.4.0
1. Describe the bug/ 问题描述 (Mandatory / 必填)
利用mindnlp.engine.trainer中的_inner_training_loop来实现代码迁移vec2text的过程中step训练速度线性变慢导致无法跑完正常的epoch,从而得到满意的迁移结果
Hardware Environment(
Ascend
/GPU
/CPU
) / 硬件环境:RTX 3090
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
MindSpore2.2.14
mindnlp0.4.0
2. Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative
/Graph
):PyNative(当然这个是无关的,因为我也试过默认的Graph模式,照样没用)
3. To Reproduce / 重现步骤 (Mandatory / 必填)
4. Expected behavior / 预期结果 (Mandatory / 必填)
解决这个bug,能每个step训练速度都一样,完成迁移任务
5. Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
如下是初始训练的time/step,大概2,3s一个step,速度还可以:
这是训到500次,可以看出很慢了, 而且未来还会更慢:
我使用单卡卡1进行训练,可以看到训练后期利用率1%,一个step的速度也很慢
6. Additional context / 备注 (Optional / 选填)
因为Trainer对标的是hf的transformer库,所以根据transformer中的写法,以及线性变慢的现象,初步可能的猜测是:
1.可以看到,原transformer代码的training_step中有del inputs操作和清空缓存操作,mindnlp.engine.Trainer中没有?
https://github.com/huggingface/transformers/blob/a3d69a8994d673899608a7c17fbf4f953f50474e/src/transformers/trainer.py#L3615C1-L3631C41
2.计算图没有清理的问题吗?因为在_inner_training_loop中要计算累积梯度,所以mindspore框架下反向传播得到grad之后计算图也没清理,导致计算图越来越大,从而可能影响each step time?
希望能得到本仓库开发者的帮助,非常感谢!
The text was updated successfully, but these errors were encountered: