Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

利用mindnlp.engine.trainer中的_inner_training_loop来实现模型迁移的过程中step训练速度线性变慢导致无法跑完正常的epoch #1818

Open
EdwinWang37 opened this issue Nov 14, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@EdwinWang37
Copy link

EdwinWang37 commented Nov 14, 2024

1. Describe the bug/ 问题描述 (Mandatory / 必填)

利用mindnlp.engine.trainer中的_inner_training_loop来实现代码迁移vec2text的过程中step训练速度线性变慢导致无法跑完正常的epoch,从而得到满意的迁移结果

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:
    RTX 3090

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    MindSpore2.2.14
    mindnlp0.4.0

2. Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

PyNative(当然这个是无关的,因为我也试过默认的Graph模式,照样没用)

3. To Reproduce / 重现步骤 (Mandatory / 必填)

  1. 我的训练输入inputs是mindspore中的generatordataset,然后连同model输入到本地的继承mindnlp.engine.Trainer的basetrainer中
  2. 利用mindnlp.engine.Trainer中的_inner_training_loop()方法实现训练过程,这也是我采用的。
  3. 然后每个批次在training_step()中进行计算loss和grad
  4. 最后_inner_training_loop中进行计算累积梯度?

4. Expected behavior / 预期结果 (Mandatory / 必填)

解决这个bug,能每个step训练速度都一样,完成迁移任务

5. Screenshots/ 日志 / 截图 (Mandatory / 必填)

If applicable, add screenshots to help explain your problem.

如下是初始训练的time/step,大概2,3s一个step,速度还可以:

image

这是训到500次,可以看出很慢了, 而且未来还会更慢:

image

我使用单卡卡1进行训练,可以看到训练后期利用率1%,一个step的速度也很慢

image

6. Additional context / 备注 (Optional / 选填)

因为Trainer对标的是hf的transformer库,所以根据transformer中的写法,以及线性变慢的现象,初步可能的猜测是:

1.可以看到,原transformer代码的training_step中有del inputs操作和清空缓存操作,mindnlp.engine.Trainer中没有?
https://github.com/huggingface/transformers/blob/a3d69a8994d673899608a7c17fbf4f953f50474e/src/transformers/trainer.py#L3615C1-L3631C41
2.计算图没有清理的问题吗?因为在_inner_training_loop中要计算累积梯度,所以mindspore框架下反向传播得到grad之后计算图也没清理,导致计算图越来越大,从而可能影响each step time?

希望能得到本仓库开发者的帮助,非常感谢!

@EdwinWang37 EdwinWang37 added the bug Something isn't working label Nov 14, 2024
@EdwinWang37 EdwinWang37 reopened this Nov 15, 2024
@lvyufeng
Copy link
Collaborator

应该不是第二个原因,图肯定得留着否则没法计算反向,第一个我看看

@EdwinWang37
Copy link
Author

应该不是第二个原因,图肯定得留着否则没法计算反向,第一个我看看

#1793 (comment)
我看1793是跟我的情况一样,不知道他换成Ascend之后解决了没有

@dayunyan
Copy link

@EdwinWang37 我换Ascend之后就没有这个问题了。

@EdwinWang37
Copy link
Author

@EdwinWang37 我换Ascend之后就没有这个问题了。

@dayunyan 非常感谢!那我申请个代金券重新跑一下试试

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants