910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错

**Describe the bug/ 问题描述 (Mandatory / 必填)**
A clear and concise description of what the bug is.
910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错
- **Hardware Environment(`Ascend`/`GPU`/`CPU`)  / 硬件环境**:
> Please delete the backend not involved / 请删除不涉及的后端:
> /device ascend/GPU/CPU/kirin/等其他芯片
Ascend 910A 32G
- **Software Environment / 软件环境 (Mandatory / 必填)**:
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Ascend HDK24.1.RC3
CANN 8.0.0
MindSpore 2.5.0
- **Excute Mode / 执行模式 (Mandatory / 必填)(`PyNative`/`Graph`)**:
> Please delete the mode not involved / 请删除不涉及的模式:
> /mode pynative
> /mode graph
参考执行llm/inference/llama3
**To Reproduce / 重现步骤 (Mandatory / 必填)**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
msrun --worker_num=4 --local_worker_num=4 --master_port=8118 --join=True --bind_core=True run_llama3_distributed.py
**Expected behavior / 预期结果 (Mandatory / 必填)**
A clear and concise description of what you expected to happen.

**Screenshots/ 日志 / 截图 (Mandatory / 必填)**
If applicable, add screenshots to help explain your problem.
Loading checkpoint shards:   0%|                                                                                                              | 0/4 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  25%|█████████████████████████▌                                                                            | 1/4 [00:16<00:49, 16.39s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  25%|█████████████████████████▌                                                                            | 1/4 [00:17<00:52, 17.51s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  33%|██████████████████████████████████                                                                    | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  50%|███████████████████████████████████████████████████                                                   | 2/4 [00:25<00:24, 12.13s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  67%|████████████████████████████████████████████████████████████████████                                  | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  67%|████████████████████████████████████████████████████████████████████                                  | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  50%|███████████████████████████████████████████████████                                                   | 2/4 [00:32<00:32, 16.43s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
Traceback (most recent call last):
  File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in <module>
    outputs = model.generate(
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
    self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
    eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
    return mindspore.tensor(token, dtype=mindspore.int64)
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
    return Tensor(input_data, dtype, shape, init, internal, const_arg)  # @jit.typing: () -> tensor_type[{dtype}]
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in __init__
    _check_input_data_type(input_data)
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
    raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  75%|████████████████████████████████████████████████████████████████████████████▌                         | 3/4 [00:40<00:12, 12.50s/it]low_cpu_mem usage is not avaliable.
[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Loading checkpoint shards:  75%|████████████████████████████████████████████████████████████████████████████▌                         | 3/4 [00:43<00:14, 14.47s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
Traceback (most recent call last):
  File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in <module>
    outputs = model.generate(
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
    self._prepare_special_tokens(generation_config, kwargs_has_attention_mask)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1609, in _prepare_special_tokens
    eos_token_tensor = _tensor_or_none(generation_config.eos_token_id)
  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1606, in _tensor_or_none
    return mindspore.tensor(token, dtype=mindspore.int64)
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
    return Tensor(input_data, dtype, shape, init, internal, const_arg)  # @jit.typing: () -> tensor_type[{dtype}]
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in __init__
    _check_input_data_type(input_data)
  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
    raise TypeError(
TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:312] Worker process 162172 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:12.444.000 [mindspore/parallel/cluster/process_entity/_api.py:318] There's worker exits with exception, kill all other workers.
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.720.000 [mindspore/parallel/cluster/process_entity/_api.py:331] Scheduler process 162146 exit with exception.
[WARNING] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.721.000 [mindspore/parallel/cluster/process_entity/_api.py:334] Analyzing exception log...
[ERROR] ME(162078:281473590168608,MainProcess):2025-03-10-09:40:38.722.000 [mindspore/parallel/cluster/process_entity/_api.py:431] Time out nodes are ['3']
scheduler.log-58-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:20.507.092 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-59-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.237 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-60-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:25.507.316 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log-61-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.452 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:98] Finalize] The meta server node can not be finalized because there are still 4 alive nodes.
scheduler.log-62-[WARNING] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:30.507.508 [mindspore/ccsrc/distributed/cluster/cluster_context.cc:153] Finalize] This log means the cluster is successfully created. Retry to finalize the node and exit cluster...
scheduler.log:63:[ERROR] DISTRIBUTED(162146,ffff2b7ef120,python):2025-03-10-09:40:31.015.657 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:511] UpdateTopoState] The node: 3 is timed out. It may exit with exception, please check this node's log.
scheduler.log:64:[ERROR] DISTRIBUTED(162146,ffffb85d0c80,python):2025-03-10-09:40:35.507.622 [mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:103] Finalize] There are 1 abnormal compute graph nodes.
scheduler.log:65:Traceback (most recent call last):
scheduler.log-66-  File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 7, in <module>
scheduler.log-67-    init()
scheduler.log-68-  File "/usr/local/lib/python3.10/dist-packages/mindspore/communication/management.py", line 198, in init
scheduler.log-69-    init_cluster()
scheduler.log:70:RuntimeError: The total number of timed out node is 1. Timed out node list is: [const vector]{3}, worker 3 is the first one timed out, please check its log.
scheduler.log-71-
scheduler.log-72-----------------------------------------------------
scheduler.log-73-- C++ Call Stack: (For framework developers)
scheduler.log-74-----------------------------------------------------
scheduler.log-75-mindspore/ccsrc/distributed/cluster/topology/meta_server_node.cc:517 UpdateTopoState
--
worker_0.log-44-Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Loading checkpoint shards:   0%|                                                                                                              | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  33%|██████████████████████████████████                                                                    | 1/3 [00:20<00:40, 20.20s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  67%|████████████████████████████████████████████████████████████████████                                  | 2/3 [00:29<00:13, 13.90s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:48<00:00, 16.07s/it]
worker_0.log:49:Traceback (most recent call last):
worker_0.log-50-  File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in <module>
worker_0.log-51-    outputs = model.generate(
worker_0.log-52-  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_0.log-53-    return func(*args, **kwds)
worker_0.log-54-  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
--
worker_0.log-60-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_0.log-61-    return Tensor(input_data, dtype, shape, init, internal, const_arg)  # @jit.typing: () -> tensor_type[{dtype}]
worker_0.log-62-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in __init__
worker_0.log-63-    _check_input_data_type(input_data)
worker_0.log-64-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_0.log:65:    raise TypeError(
worker_0.log:66:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
--
worker_3.log-43-Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Loading checkpoint shards:   0%|                                                                                                              | 0/3 [00:00<?, ?it/s]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  33%|██████████████████████████████████                                                                    | 1/3 [00:16<00:32, 16.07s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards:  67%|████████████████████████████████████████████████████████████████████                                  | 2/3 [00:31<00:15, 15.96s/it]low_cpu_mem usage is not avaliable.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:39<00:00, 13.20s/it]
worker_3.log:48:Traceback (most recent call last):
worker_3.log-49-  File "/data/mindnlp-master/llm/inference/llama3/run_llama3_distributed.py", line 32, in <module>
worker_3.log-50-    outputs = model.generate(
worker_3.log-51-  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
worker_3.log-52-    return func(*args, **kwds)
worker_3.log-53-  File "/usr/local/lib/python3.10/dist-packages/mindnlp/transformers/generation/utils.py", line 1789, in generate
--
worker_3.log-59-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 113, in tensor
worker_3.log-60-    return Tensor(input_data, dtype, shape, init, internal, const_arg)  # @jit.typing: () -> tensor_type[{dtype}]
worker_3.log-61-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 258, in __init__
worker_3.log-62-    _check_input_data_type(input_data)
worker_3.log-63-  File "/usr/local/lib/python3.10/dist-packages/mindspore/common/tensor.py", line 71, in _check_input_data_type
worker_3.log:64:    raise TypeError(
worker_3.log:65:TypeError: For Tensor, the input_data is [151643, None] that contain unsupported element.
worker_3.log-66-[INFO] PS(162172,ffff0ca8f120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_server.cc:220] Start] Event base dispatch success!
worker_3.log-67-[INFO] PS(162172,fffef7fff120,python):2025-03-10-09:40:04.783.867 [mindspore/ccsrc/ps/core/communicator/tcp_client.cc:318] Start] Event base dispatch success!
Traceback (most recent call last):
  File "/usr/local/bin/msrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 150, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/run.py", line 144, in run
    process_manager.run()
  File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 225, in run
    self.join_processes()
  File "/usr/local/lib/python3.10/dist-packages/mindspore/parallel/cluster/process_entity/_api.py", line 336, in join_processes
    raise RuntimeError("Distributed job exited with exception. Please check logs in "
RuntimeError: Distributed job exited with exception. Please check logs in directory: .

**Additional context / 备注 (Optional / 选填)**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错 #1983

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

910A四卡跑DeepSeek-R1-Distill-Qwen-32B报错 #1983

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions