Skip to content

llm/peft/lora/lora_seq2seq.ipynb案例执行遇到CANN异常 #1917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tridu33 opened this issue Jan 22, 2025 · 3 comments
Open

llm/peft/lora/lora_seq2seq.ipynb案例执行遇到CANN异常 #1917

Tridu33 opened this issue Jan 22, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@Tridu33
Copy link
Contributor

Tridu33 commented Jan 22, 2025

Describe the bug/ 问题描述 (Mandatory / 必填)
lora_seq2seq案例执行遇到CANN异常

  • Software Environment / 软件环境 (Mandatory / 必填):
$ pip list | grep mind
mindnlp                           0.4.1
mindspore                         2.3.1
$ uname -a
Linux k8s-master 5.4.0-125-generic #141-Ubuntu SMP Wed Aug 10 14:03:41 UTC 2022 aarch64 aarch64 aarch64 GNU/Linux

CANN

tridu33@k8s-master:~$ npu-smi info
cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
cat /usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/kernel/version.info
python -c "import acl;"
cat /usr/local/Ascend/firmware/version.info
cat /usr/local/Ascend/driver/version.info
+------------------------------------------------------------------------------------------------+
| npu-smi 23.0.0                   Version: 23.0.0                                               |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B                | OK            | 70.8        44                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           1162 / 13553      9    / 32768         |
+===========================+===============+====================================================+
| 1     910B                | OK            | 65.9        40                0    / 0             |
| 0                         | 0000:81:00.0  | 0           1734 / 15665      12   / 32768         |
+===========================+===============+====================================================+
| 2     910B                | OK            | 67.6        38                0    / 0             |
| 0                         | 0000:41:00.0  | 0           2411 / 15665      9    / 32768         |
+===========================+===============+====================================================+
| 3     910B                | OK            | 67.0        44                0    / 0             |
| 0                         | 0000:01:00.0  | 0           3067 / 15567      8    / 32768         |
+===========================+===============+====================================================+
| 4     910B                | OK            | 69.9        42                0    / 0             |
| 0                         | 0000:C2:00.0  | 0           1267 / 13553      10   / 32768         |
+===========================+===============+====================================================+
| 5     910B                | OK            | 65.8        38                0    / 0             |
| 0                         | 0000:82:00.0  | 0           2012 / 15665      10   / 32768         |
+===========================+===============+====================================================+
| 6     910B                | OK            | 68.6        40                0    / 0             |
| 0                         | 0000:42:00.0  | 0           2484 / 15665      10   / 32768         |
+===========================+===============+====================================================+
| 7     910B                | OK            | 67.6        44                0    / 0             |
| 0                         | 0000:02:00.0  | 0           2595 / 15567      8    / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 0                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 2                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 5                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 6                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+
tridu33@k8s-master:~$ cat /usr/local/Ascend/version.info
version=23.0.0
tridu33@k8s-master:~$ cat /usr/local/Ascend/ascend-toolkit/latest/aarch64-linux/ascend_toolkit_install.info
package_name=Ascend-cann-toolkit
version=8.0.RC3.alpha002
innerversion=V100R001C77B220SPC008
compatible_version=[V100R001C80,V100R001C84],[V100R001C77,V100R001C79],[V100R001C29],[V100R001C11,V100R001C50]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.0.RC3.alpha002/aarch64-linux
tridu33@k8s-master:~$ cat /usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/kernel/version.info
Version=7.5.T6.0.B036
version_dir=8.0.RC3.alpha002
timestamp=20240821_104556995
ops_version=7.5.T6.0.B036
adk_version=7.5.T6.0.B036
required_package_amct_acl_version="7.5"
required_package_aoe_version="7.5"
required_package_compiler_version="7.5"
required_package_fwkplugin_version="7.5"
required_package_hccl_version="7.5"
required_package_nca_version="7.5"
required_package_ncs_version="7.5"
required_package_opp_version="7.5"
required_package_runtime_version="7.5"
required_package_toolkit_version="7.5"
tridu33@k8s-master:~$ python -c "import acl;" # 没有报错

tridu33@k8s-master:~$ cat /usr/local/Ascend/firmware/version.info
cat: /usr/local/Ascend/firmware/version.info: Permission denied
tridu33@k8s-master:~$ cat /usr/local/Ascend/driver/version.info
Version=23.0.0
ascendhal_version=7.35.19
aicpu_version=1.0
tdt_version=1.0
log_version=1.0
prof_version=2.0
dvppkernels_version=1.1
tsfw_version=1.0
Innerversion=V100R001C15SPC002B224
compatible_version=[V100R001C29],[V100R001C30],[V100R001C13],[V100R001C15]
compatible_version_fw=[7.0.0,7.1.99]
package_version=23.0.0

910Ascend

To Reproduce / 重现步骤 (Mandatory / 必填)
~/workspace/githubSrc/mindnlp/llm/peft/lora$ python lora_seq2seq.py

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.

MT5ForConditionalGeneration has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`.`PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
[MS_ALLOC_CONF]Runtime config:  enable_vmm:True  vmm_align_size:2MB
Traceback (most recent call last):
  File "/home/usersshared/githubSrc/mindnlp/llm/peft/lora/lora_seq2seq.py", line 34, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/auto/auto_factory.py", line 510, in from_pretrained
    return model_class.from_pretrained(
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/modeling_utils.py", line 3126, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 1134, in __init__
    self.encoder = MT5Stack(encoder_config, self.shared)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 715, in __init__
    [MT5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 715, in <listcomp>
    [MT5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 460, in __init__
    self.layer.append(MT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 389, in __init__
    self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 59, in __init__
    self.weight = nn.Parameter(ops.ones(hidden_size))
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/core/ops/creation.py", line 62, in ones
    return mindspore.mint.ones(size, dtype=dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/mint/__init__.py", line 692, in ones
    return ops.auto_generate.ones(size, dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/ops/auto_generate/gen_ops_def.py", line 3971, in ones
    return ones_op(shape, dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/ops/operations/manually_defined/ops_def.py", line 1817, in __call__
    return _convert_stub(pyboost_ones(self, [size, type if type is None \
RuntimeError: Initialize GE failed!

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
EC0010: [PID: 2179977] 2025-01-23-01:09:06.098.879 Failed to import Python module [AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead..].
        Solution: Check that all required components are properly installed and the specified Python path matches the Python installation directory. (If the path does not match the directory, run set_env.sh in the installation package.)
        TraceBack (most recent call last):
        AOE Failed to call InitCannKB
        [GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbe.[FUNC:InitializeInner][FILE:tbe_op_store_adapter.cc][LINE:1719]
        [SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapter adapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FILE:op_store_adapter_manager.cc][LINE:79]
        [SubGraphOpt][PreCompileOp][Init] Initialize op store adapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_manager.cc][LINE:120]
        [FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:fusion_manager.cc][LINE:117]
        PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_kernel_manager.cc][LINE:82]
        OpsManager initialize failed.[FUNC:InnerInitialize][FILE:gelib.cc][LINE:234]
        GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelib.cc][LINE:162]
        GEInitialize failed.[FUNC:GEInitialize][FILE:ge_api.cc][LINE:306]

(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_device_context.cc:253 InitGe
@Tridu33 Tridu33 added the bug Something isn't working label Jan 22, 2025
@Tridu33
Copy link
Contributor Author

Tridu33 commented Jan 22, 2025

类似的,roberta_sequence_classification.ipynb 案例python roberta_sequence_classification.py 也是遇到mindspore的异常

Traceback (most recent call last):
  File "/home/usersshared/githubSrc/mindnlp/llm/peft/lora/roberta_sequence_classification.py", line 70, in <module>
    print(next(datasets['train'].create_dict_iterator()))
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 152, in __next__
    data = self._get_next()
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 277, in _get_next
    raise err
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 260, in _get_next
    return {k: self._transform_md_to_output(t) for k, t in self._iterator.GetNextAsMap().items()}
RuntimeError: Exception thrown from user defined Python function in dataset. 

------------------------------------------------------------------
- Python Call Stack: 
------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 104, in _cpp_sampler_fn
    yield _convert_row(val)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/dataset/engine/datasets_user_defined.py", line 173, in _convert_row
    item = np.array(x, copy=False)
ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

------------------------------------------------------------------
- Dataset Pipeline Error Message: 
------------------------------------------------------------------
[ERROR] Execute user Python code failed, check 'Python Call Stack' above.

------------------------------------------------------------------
- C++ Call Stack: (For framework developers) 
------------------------------------------------------------------
mindspore/ccsrc/minddata/dataset/engine/datasetops/source/generator_op.cc(261).

@Tridu33
Copy link
Contributor Author

Tridu33 commented Jan 23, 2025

CANN的版本不配套,23.0.0的hdk不对,需要根据 https://www.mindspore.cn/versions 安装hdk 24的版本

@Tridu33
Copy link
Contributor Author

Tridu33 commented Jan 23, 2025

修改NPU驱动版本为hdk 24.1.rc2,CANN版本为24.1.rc2,执行~/workspace/githubSrc/mindnlp/llm/peft/lora$ python lora_seq2seq.p报错变了:

[ERROR] RUNTIME(33130,python):2025-01-24-01:36:21.067.639 [driver.cc:65]33130 GetDeviceCount:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(33130,python):2025-01-24-01:36:21.067.803 [driver.cc:65]33130 GetDeviceCount:Call drvGetDevNum, drvRetCode=7.
[ERROR] RUNTIME(33130,python):2025-01-24-01:36:21.068.050 [api_c_device.cc:21]33130 rtGetDeviceCount:ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020010
[ERROR] RUNTIME(33130,python):2025-01-24-01:36:21.068.122 [error_message_manage.cc:53]33130 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(33130,python):2025-01-24-01:36:21.068.209 [error_message_manage.cc:53]33130 FuncErrorReason:rtGetDeviceCount execute failed, reason=[driver error:internal error]
[ERROR] ASCENDCL(33130,python):2025-01-24-01:36:21.068.345 [device.cpp:366]33130 aclrtGetDeviceCount: get device count failed, runtime result = 507899.
Traceback (most recent call last):
  File "/home/usersshared/githubSrc/mindnlp/llm/peft/lora/lora_seq2seq.py", line 34, in <module>
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name_or_path)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/auto/auto_factory.py", line 510, in from_pretrained
    return model_class.from_pretrained(
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/modeling_utils.py", line 3126, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 1134, in __init__
    self.encoder = MT5Stack(encoder_config, self.shared)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 715, in __init__
    [MT5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 715, in <listcomp>
    [MT5Block(config, has_relative_attention_bias=bool(i == 0)) for i in range(config.num_layers)]
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 460, in __init__
    self.layer.append(MT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 389, in __init__
    self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/transformers/models/mt5/modeling_mt5.py", line 59, in __init__
    self.weight = nn.Parameter(ops.ones(hidden_size))
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindnlp/core/ops/creation.py", line 62, in ones
    return mindspore.mint.ones(size, dtype=dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/mint/__init__.py", line 692, in ones
    return ops.auto_generate.ones(size, dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/ops/auto_generate/gen_ops_def.py", line 3971, in ones
    return ones_op(shape, dtype)
  File "/home/tridu33/.conda/envs/mindnlp/lib/python3.9/site-packages/mindspore/ops/operations/manually_defined/ops_def.py", line 1817, in __call__
    return _convert_stub(pyboost_ones(self, [size, type if type is None \
RuntimeError: Ascend kernel runtime initialization failed. The details refer to 'Ascend Error Message'.

----------------------------------------------------
- Framework Error Message: (For framework developers)
----------------------------------------------------
Call rtGetDeviceCount, ret[507899]

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:358 Init
mindspore/ccsrc/plugin/device/ascend/hal/device/ascend_kernel_runtime.cc:642 SetRtDevice

[INFO] RUNTIME(33130,python):2025-01-24-01:36:22.458.448 [runtime.cc:1991] 33130 ~Runtime: deconstruct runtime
[INFO] RUNTIME(33130,python):2025-01-24-01:36:22.463.831 [runtime.cc:1998] 33130 ~Runtime: wait monitor success, use=0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant