Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error generating text when using the exllama_HF loader and using a grammar file #6503

Open
1 task done
GregorioBrc opened this issue Oct 30, 2024 · 0 comments
Open
1 task done
Labels
bug Something isn't working

Comments

@GregorioBrc
Copy link

Describe the bug

When trying to generate a response with the exllamav2_HF loader and the roleplay grammar file, it generates a small text and throws an error in the console in different parts of the code.

Is there an existing issue for this?

  • I have searched the existing issues

Reproduction

Load a GPTQ or Exllamav2 model with the ExllamaV2_HF loader

Load a grammar file, in my case I try with the roleplay file.

Try to generate a response

Screenshot

imagen
imagen
imagen

Logs

changed 22 packages, and audited 23 packages in 1s

3 packages are looking for funding
  run `npm fund` for details

1 moderate severity vulnerability

To address all issues (including breaking changes), run:
  npm audit fix --force

Run `npm audit` for details.
/content/text-generation-webui
02:30:11-152229 INFO     Starting Text generation web UI                                            

Running on local URL:  http://127.0.0.1:7860

\CFUI finished loading, trying to launch localtunnel (if it gets stuck here localtunnel is having issues)

Running on public URL: https://702c9d7284631c1938.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
The password/enpoint ip for localtunnel is: 34.168.100.7
your url is: https://easy-cases-marry.loca.lt
02:32:48-563494 INFO     Loading "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw"                       
 ## Warning: Flash Attention is installed but unsupported GPUs were detected.
2024-10-30 02:32:52.059218: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 02:32:52.091130: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 02:32:52.101257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-30 02:32:52.135907: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-10-30 02:32:54.046606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
02:33:50-722732 INFO     Loaded "Epiculous_Violet_Twilight-v0.2-exl2_4.0bpw" in 62.16 seconds.      
02:33:50-726901 INFO     LOADER: "ExLlamav2_HF"                                                     
02:33:50-727873 INFO     TRUNCATION LENGTH: 16000                                                   
02:33:50-728797 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"              
Warning: unrecognized tokenizer: using default token formatting
../aten/src/ATen/native/cuda/TensorCompare.cu:110: _assert_async_cuda_kernel: block: [0,0,0], thread: [0,0,0] Assertion `probability tensor contains either `inf`, `nan` or element < 0` failed.
Traceback (most recent call last):
  File "/content/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/content/text-generation-webui/modules/text_generation.py", line 398, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 3195, in _sample
    while self._has_unfinished_sequences(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2413, in _has_unfinished_sequences
    elif this_peer_finished:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
Exception in thread Thread-3 (gentask):
  File "/content/text-generation-webui/modules/text_generation.py", line 407, in generate_reply_HF
    if output[-1] in eos_token_ids:
Traceback (most recent call last):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/text-generation-webui/modules/text_generation.py", line 403, in generate_reply_HF
    with generate_with_streaming(**generate_params) as generator:
  File "/content/text-generation-webui/modules/callbacks.py", line 94, in __exit__
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Output generated in 4.44 seconds (0.90 tokens/s, 4 tokens, context 359, seed 1331220601)
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/content/text-generation-webui/modules/callbacks.py", line 68, in gentask
    clear_torch_cache()
  File "/content/text-generation-webui/modules/callbacks.py", line 105, in clear_torch_cache
    torch.cuda.empty_cache()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/memory.py", line 192, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

System Info

Google Colab
@GregorioBrc GregorioBrc added the bug Something isn't working label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant