Skip to content

910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错 #1983

Open
@RongRongStudio

Description

@RongRongStudio

Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
Ascend 910A 32G

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    Ascend HDK24.1.RC3
    CANN 8.0.0
    MindSpore 2.5.0
  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
参考执行llm/inference/llama3
To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error
    msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py
    Expected behavior / 预期结果 (Mandatory / 必填)
    A clear and concise description of what you expected to happen.

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 25%|█████████████████████████▌ | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 50%|███████████████████████████████████████████████████ | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards: 75%|████████████████████████████████████████████████████████████████████████████▌ | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
outputs = model.generate(
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
return mindspore.tensor(token, dtype=mindspore.int64)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
_check_input_data_type(input_data)
File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in
scheduler.log-67- init()
scheduler.log-68- File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69- init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState

worker_0.log-44-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_0.log-51- outputs = model.generate(
worker_0.log-52- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53- return func(*args, **kwds)
worker_0.log-54- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate

worker_0.log-60- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_0.log-63- _check_input_data_type(input_data)
worker_0.log-64- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65: raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.

worker_3.log-43-Sliding Window Attention is enabled but not implemented for eager; unexpected results may be encountered.
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 33%|██████████████████████████████████ | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 67%|████████████████████████████████████████████████████████████████████ | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49- File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in
worker_3.log-50- outputs = model.generate(
worker_3.log-51- File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52- return func(*args, **kwds)
worker_3.log-53- File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate

worker_3.log-59- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60- return Tensor(input_data, dtype, shape, init, internal, const_arg) # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in init
worker_3.log-62- _check_input_data_type(input_data)
worker_3.log-63- File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64: raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
File "/usr/local/bin/msrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
process_manager.run()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
self.join_processes()
File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions