Skip to content

[Bug]: vllm0.8.4+vllm_ascend0.8.4rc2(驱动24.1rc2,cann8.1rc1(cann8.0也试过),torch_npu2.5.1),离线能够跑起来,并发运行在线推理服务,算子库链接不到aclnnNonzeroV2 #1006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
towy98 opened this issue May 29, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@towy98
Copy link

towy98 commented May 29, 2025

Your current environment

The output of `python collect_env.py`
vllm0.8.4+vllm_ascend0.8.4rc2(驱动24.1rc2,cann8.1rc1(cann8.0也试过),torch_npu2.5.1)

🐛 Describe the bug

离线可以运行,在线服务并发测试时找不到对应算子库
call aclnnNonzeroV2 failed, detail:E39999: Inner Error!
E39999: [PID: 85141] 2025-05-29-12:09:01.468.164 The error from device(chipId:0, dieId:0), serial number is 3, an exception occurred during AICPU execution, stream_id:6, task_id:2895, errcode:11002, msg:open so failed.[FUNC:ProcessStarsAicpuErrorInfo][FILE:device_error_proc.cc][LINE:1479]
TraceBack (most recent call last):
Kernel task happen error, retCode=0x2a, [aicpu exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1356]
AICPU Kernel task happen error, retCode=0x2a.[FUNC:GetError][FILE:stream.cc][LINE:1124]
Aicpu kernel execute failed, device_id=0, stream_id=6, task_id=2895, errorCode=2a.[FUNC:PrintAicpuErrorInfo][FILE:davinci_kernel_task.cc][LINE:1120]
Aicpu kernel execute failed, device_id=0, stream_id=6, task_id=2895, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1124]
rtStreamSynchronize execute failed, reason=[aicpu exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
rtStreamSynchronize failed. stream: 0x5062fad0
Kernel Run failed. opType: 51, NonZero
launch failed for NonZero, errno:507018.

RuntimeError('The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnNonzeroV2.\nSince the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.\nNote: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.\n[ERROR] 2025-05-29-12:09:01 (PID:85141, Device:0, RankID:-1) ERR00100 PTA call acl api failed.\n')

@towy98 towy98 added the bug Something isn't working label May 29, 2025
@towy98 towy98 changed the title [Bug]: vllm0.8.4+vllm_ascend0.8.4rc2,离线能够跑起来,并发运行在线推理服务,算子库链接不到aclnnNonzeroV2 [Bug]: vllm0.8.4+vllm_ascend0.8.4rc2(驱动24.1rc2,cann8.1rc1(cann8.0),torch_npu2.5.1),离线能够跑起来,并发运行在线推理服务,算子库链接不到aclnnNonzeroV2 May 29, 2025
@towy98 towy98 changed the title [Bug]: vllm0.8.4+vllm_ascend0.8.4rc2(驱动24.1rc2,cann8.1rc1(cann8.0),torch_npu2.5.1),离线能够跑起来,并发运行在线推理服务,算子库链接不到aclnnNonzeroV2 [Bug]: vllm0.8.4+vllm_ascend0.8.4rc2(驱动24.1rc2,cann8.1rc1(cann8.0也试过),torch_npu2.5.1),离线能够跑起来,并发运行在线推理服务,算子库链接不到aclnnNonzeroV2 May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant