Error generating text when using the exllama_HF loader and using a grammar file #6503

GregorioBrc · 2024-10-30T02:45:54Z

Describe the bug

When trying to generate a response with the exllamav2_HF loader and the roleplay grammar file, it generates a small text and throws an error in the console in different parts of the code.

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Load a GPTQ or Exllamav2 model with the ExllamaV2_HF loader

Load a grammar file, in my case I try with the roleplay file.

Try to generate a response

Screenshot

Logs

changed 22 packages, and audited 23 packages in 1s

3 packages are looking for funding
  run `npm fund` for details

1 moderate severity vulnerability

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
/content/text-generation-webui
02:30:11-152229 INFO     Starting Text generation web UI                                            

Running on local URL:  http://127.0.0.1:7860

\CFUI finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)

Running on public URL: https://702c9d7284631c1938.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
The password/enpoint ip for localtunnel is: 34.168.100.7
your url is: https://easy-cases-marry.loca.lt
02:32:48-563494 INFO     Loading "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw"                       
 ## Warning: Flash Attention is installed but unsupported GPUs were detected.
2024-10-30 02:32:52.059218: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 02:32:52.091130: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 02:32:52.101257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 02:32:52.135907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-30 02:32:54.046606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
02:33:50-722732 INFO     Loaded "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw" in 62.16 seconds.      
02:33:50-726901 INFO     LOADER: "ExLlamav2_HF"                                                     
02:33:50-727873 INFO     TRUNCATION LENGTH: 16000                                                   
02:33:50-728797 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"              
Warning: unrecognized tokenizer: using default token formatting
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3195, in _sample
    while self._has_unfinished_sequences(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2413, in _has_unfinished_sequences
    elif this_peer_finished:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
Exception in thread Thread-3 (gentask):
  File "/content/text-generation-webui/modules/text_generation.py", line 407, in generate_reply_HF
    if output[-1] in eos_token_ids:
Traceback (most recent call last):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/text_generation.py", line 403, in generate_reply_HF
    with generate_with_streaming(**generate_params) as generator:
  File "/content/text-generation-webui/modules/callbacks.py", line 94, in __exit__
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Output generated in 4.44 seconds (0.90 tokens/s, 4 tokens, context 359, seed 1331220601)
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/content/text-generation-webui/modules/callbacks.py", line 68, in gentask
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

System Info

Google Colab

GregorioBrc added the bug Something isn't working label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error generating text when using the exllama_HF loader and using a grammar file #6503

Error generating text when using the exllama_HF loader and using a grammar file #6503

GregorioBrc commented Oct 30, 2024

Error generating text when using the exllama_HF loader and using a grammar file #6503

Error generating text when using the exllama_HF loader and using a grammar file #6503

Comments

GregorioBrc commented Oct 30, 2024

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info