Description
Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错
- Hardware Environment(
Ascend
/GPU
/CPU
) / 硬件环境:
Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
Ascend 910A 32G
- Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Ascend HDK24.1.RC3
CANN 8.0.0
MindSpore 2.5.0 - Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
参考执行llm/inference/llama3
To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:
- Go to '...'
- Click on '....'
- Scroll down to '....'
- See error
msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py
Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in
scheduler.log-67- init()
scheduler.log-68- File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69- init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState
worker_0.log-44-Sliding Window Attention is enabled but not implemented for eager
; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_0.log-51- outputs = model.generate(
worker_0.log-52- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53- return func(*args, **kwds)
worker_0.log-54- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_0.log-60- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_0.log-63- _check_input_data_type(input_data)
worker_0.log-64- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65: raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-43-Sliding Window Attention is enabled but not implemented for eager
; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_3.log-50- outputs = model.generate(
worker_3.log-51- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52- return func(*args, **kwds)
worker_3.log-53- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
worker_3.log-59- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_3.log-62- _check_input_data_type(input_data)
worker_3.log-63- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64: raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
File "/usr/local/bin/msrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
process_manager.run()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
self.join_processes()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .
Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.