CodeGen inference error "synNodeCreateWithId failed for node: batch_gemm with synStatus 26" #1314

caijimin · 2024-09-05T09:43:52Z

System Info

+-----------------------------------------------------------------------------+
| HL-SMI Version:                                hl-1.17.0-fw-51.1.0          |
| Driver Version:                                     1.17.0-8a5dfb8          |
|-------------------------------+----------------------+----------------------+

docker: vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

# cd examples/text-generation
python run_generation.py --model_name_or_path /DISK0/codegen-6B-multi --use_hpu_graphs --use_kv_cache --max_new_tokens 100 --do_sample --prompt "solve the quick sort problem" --bf16


/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
09/05/2024 09:23:40 - INFO - __main__ - Single-device run.
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 112
CPU RAM       : 1056426836 KB
------------------------------------------------------------------------------
09/05/2024 09:23:48 - INFO - __main__ - Args: Namespace(device='hpu', model_name_or_path='/DISK0/codegen-6B-multi', bf16=True, max_new_tokens=100, max_input_tokens=0, batch_size=1, warmup=3, n_iterations=5, local_rank=0, use_kv_cache=True, use_hpu_graphs=True, dataset_name=None, column_name=None, do_sample=True, num_beams=1, top_k=None, penalty_alpha=None, trim_logits=False, seed=27, profiling_warmup_steps=0, profiling_steps=0, profiling_record_shapes=False, prompt=['solve the quick sort problem'], bad_words=None, force_words=None, assistant_model=None, peft_model=None, num_return_sequences=1, token=None, model_revision='main', attn_softmax_bf16=False, output_dir=None, bucket_size=-1, bucket_internal=False, dataset_max_samples=-1, limit_hpu_graphs=False, reuse_cache=False, verbose_workers=False, simulate_dyn_prompt=None, reduce_recompile=False, use_flash_attention=False, flash_attention_recompute=False, flash_attention_causal_mask=False, flash_attention_fast_softmax=False, book_source=False, torch_compile=False, ignore_eos=True, temperature=1.0, top_p=1.0, const_serialization_path=None, disk_offload=False, trust_remote_code=False, load_quantized_model=False, parallel_strategy='none', quant_config='', world_size=0, global_rank=0)
09/05/2024 09:23:48 - INFO - __main__ - device: hpu, n_hpu: 0, bf16: True
09/05/2024 09:23:48 - INFO - __main__ - Model initialization took 8.896s
09/05/2024 09:23:48 - INFO - __main__ - Graph compilation...
Warming up iteration 1/3
Traceback (most recent call last):
  File "/DISK0/jimin/optimum-habana/examples/text-generation/run_generation.py", line 692, in <module>
    main()
  File "/DISK0/jimin/optimum-habana/examples/text-generation/run_generation.py", line 461, in main
    generate(None, args.reduce_recompile)
  File "/DISK0/jimin/optimum-habana/examples/text-generation/run_generation.py", line 432, in generate
    outputs = model.generate(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 1287, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/optimum/habana/transformers/generation/utils.py", line 2246, in _sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1535, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1544, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 724, in forward
    return wrapped_hpugraph_forward(
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 598, in wrapped_hpugraph_forward
    graph.capture_end()
  File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 46, in capture_end
    _hpu_C.capture_end(self.hpu_graph)
RuntimeError: synNodeCreateWithId failed for node: batch_gemm with synStatus 26 [Generic failure]. .

Expected behavior

Success without error.

The text was updated successfully, but these errors were encountered:

regisss · 2024-09-05T17:21:28Z

I cannot reproduce it. Can you try again upgrading the lib with

pip install -U optimum-habana

?

LeoZhao-Intel · 2024-09-06T06:28:19Z

This issue can only be reproduced on PRC sku which FP32 GEMM is disabled.

regisss · 2024-09-06T08:19:58Z

This issue can only be reproduced on PRC sku which FP32 GEMM is disabled.

What's PRC sku?

LeoZhao-Intel · 2024-09-06T08:22:18Z

Gaudi2D, on this sku, MME FP32 is disabled.

caijimin added the bug Something isn't working label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeGen inference error "synNodeCreateWithId failed for node: batch_gemm with synStatus 26" #1314

CodeGen inference error "synNodeCreateWithId failed for node: batch_gemm with synStatus 26" #1314

caijimin commented Sep 5, 2024

regisss commented Sep 5, 2024

LeoZhao-Intel commented Sep 6, 2024

regisss commented Sep 6, 2024

LeoZhao-Intel commented Sep 6, 2024

CodeGen inference error "synNodeCreateWithId failed for node: batch_gemm with synStatus 26" #1314

CodeGen inference error "synNodeCreateWithId failed for node: batch_gemm with synStatus 26" #1314

Comments

caijimin commented Sep 5, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

regisss commented Sep 5, 2024

LeoZhao-Intel commented Sep 6, 2024

regisss commented Sep 6, 2024

LeoZhao-Intel commented Sep 6, 2024