-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Performance] Remove the redundant pd_op.assign_out_ op at the end of while loop #9002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance] Remove the redundant pd_op.assign_out_ op at the end of while loop #9002
Conversation
… loop, avoiding redundant kv cache copy
|
Thanks for your contribution! |
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (52.78%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #9002 +/- ##
===========================================
- Coverage 52.92% 52.78% -0.15%
===========================================
Files 661 661
Lines 107069 106945 -124
===========================================
- Hits 56670 56452 -218
- Misses 50399 50493 +94 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
|
Automatically closed by Paddle-bot. |
PR types
Performance optimization
PR changes
Others
Description
目前,根据的这里的相关描述,Paddle会在while循环的末尾为所有的循环变量添加
pd_op.assign_out_算子,但这是不必要的。当进行LLM解码时,这会导致每解码一个token,kv cache都被无意义地复制一遍,降低了推理的速度。这个PR编写了一个PIR Pass移除了紧接着循环末尾的cf.yield算子的pd_op.assign_out_算子。使用predictor.py在llama2模型上进行了测试,模型能够正常输出结果,同时在处理长文本时有约10%的性能提升。一个简化后的例子是,这个PIR Pass会对以下while循环体进行变换:
变换为
改进前的profiling,可以看到浅蓝色的D2D显存拷贝消耗了一部分时间


改进后的profiling