Out of ranage integral type conversation attempted #6511

Stargate256 · 2024-11-02T22:38:24Z

Describe the bug

When running inference over openAI compatable API with Perplexica or avante.nvim the error sometimes appears, after that happnes it doesn't work anymore until I restart the program. (It worked fine with Open WebUI)

Is there an existing issue for this?

I have searched the existing issues

Reproduction

Setup the program on Debian 12
run Qwen2.5-32B-Instruct-4.65bpw-h6-exl2
run inference over OpenAI compatable API (Perplexica, avante.nvim or something else)

Screenshot

Logs

Traceback (most recent call last):
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
    return self._decode(
           ^^^^^^^^^^^^^
  File "/root/llm/text-generation-webui/installer_files/env/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: out of range integral type conversion attempted
Output generated in 2.50 seconds (7.99 tokens/s, 20 tokens, context 1131, seed 1200683755)

System Info

Envoronment: Proxmox VE vm
CPU: 6 virtual cores of Xeon E5-2697 v3
GPU: 2x Nvidia Tesla P100 16GB (PCIe passtrough)
OS: Debian 12
LLM: Qwen2.5-32B-Instruct-4.65bpw-h6-exl2

bashlk · 2024-11-15T12:51:37Z

I am running into the same issue, also on Debian 12, on an older Intel CPU, while trying to run a Qwen2.5 exl2 model over the Open AI API (With cline and aider). In my case a few requests work and then it encounters this error after which the responses contain no/few characters. Unloading and loading the model again doesn't seem to help.

I'm running web-ui directly on physical hardware. I tried upgrading all the packages in my system which brought in a new kernel version but nothing changed after the upgrade.

Logs

13:27:44-948132 INFO     Starting Text generation web UI                        
13:27:44-952671 WARNING                                                         
                         You are potentially exposing the web UI to the entire  
                         internet without any access password.                  
                         You can create one with the "--gradio-auth" flag like  
                         this:                                                  
                                                                                
                         --gradio-auth username:password                        
                                                                                
                         Make sure to replace username:password with your own.  
13:27:44-954803 INFO     Loading the extension "openai"                         
13:27:45-089753 INFO     OpenAI-compatible API URL:                             
                                                                                
                         http://0.0.0.0:5000                                    
                                                                                

Running on local URL:  http://0.0.0.0:7860

13:27:51-237158 INFO     Loading                                                
                         "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"       
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
13:27:58-621663 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
                         in 7.38 seconds.                                       
13:27:58-623069 INFO     LOADER: "ExLlamav2_HF"                                 
13:27:58-624496 INFO     TRUNCATION LENGTH: 8000                                
13:27:58-625390 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model     
                         metadata)"                                             
^[[A/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:590: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:600: UserWarning: `do_sample` is set to `False`. However, `min_p` is set to `0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `min_p`.
  warnings.warn(
Output generated in 4.25 seconds (21.15 tokens/s, 90 tokens, context 896, seed 2117395925)
Output generated in 2.80 seconds (22.11 tokens/s, 62 tokens, context 1011, seed 1939351019)
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 2.30 seconds (21.29 tokens/s, 49 tokens, context 1274, seed 1253585262)
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 0.74 seconds (1.36 tokens/s, 1 tokens, context 1153, seed 8088406)
13:32:10-609612 INFO     Loading                                                
                         "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"       
13:32:16-405005 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"
                         in 5.79 seconds.                                       
13:32:16-407239 INFO     LOADER: "ExLlamav2_HF"                                 
13:32:16-408041 INFO     TRUNCATION LENGTH: 8000                                
13:32:16-408885 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model     
                         metadata)"                                             
Traceback (most recent call last):
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 410, in generate_reply_HF
    new_content = get_reply_from_output_ids(output, state, starting_from=starting_from)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 271, in get_reply_from_output_ids
    reply = decode(output_ids[starting_from:], state['skip_special_tokens'] if state else True)
  File "/home/gradio/text-generation-webui/modules/text_generation.py", line 181, in decode
    return shared.tokenizer.decode(output_ids, skip_special_tokens=skip_special_tokens)
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 4004, in decode
    return self._decode(
  File "/home/gradio/text-generation-webui/installer_files/env/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
OverflowError: out of range integral type conversion attempted
Output generated in 1.30 seconds (0.77 tokens/s, 1 tokens, context 1176, seed 304963644)

lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          36 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-3570K CPU @ 3.40GHz
    CPU family:           6
    Model:                58
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             9
    CPU(s) scaling MHz:   42%
    CPU max MHz:          3800.0000
    CPU min MHz:          1600.0000
    BogoMIPS:             6799.95
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mm
                          x fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_go
                          od nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est 
                          tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rd
                          rand lahf_lm cpuid_fault epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid
                           fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     6 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
  Mds:                    Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS
                           Not affected; BHI Not affected
  Srbds:                  Vulnerable: No microcode
  Tsx async abort:        Not affected

free -m

               total        used        free      shared  buff/cache   available
Mem:           23982        1365       12767           4       10200       22617
Swap:           7999           0        7999

nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        On  | 00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8              11W / 170W |    181MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A       777      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A       964      G   /usr/bin/gnome-shell                          8MiB |
+---------------------------------------------------------------------------------------+

uname -a

Linux bash-3lpc 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux

python3 --version

Python 3.11.2

conda --version

conda 23.5.2

cat /etc/debian_version

12.8

git rev-parse HEAD

cc8c7ed2093cbc747e7032420eae14b5b3c30311

bashlk · 2024-11-15T13:17:31Z

Actually, it seems like the ExLlamav2 loader works. Previously I was using the auto suggested ExLlamav2_HF loader.

Logs (prompts were sent from Aider)

14:04:59-537587 INFO     Loading "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25"                                    
14:05:06-734713 INFO     Loaded "bartowski_Qwen2.5-Coder-14B-Instruct-exl2_4_25" in 7.20 seconds.                    
14:05:06-736142 INFO     LOADER: "ExLlamav2"                                                                         
14:05:06-737239 INFO     TRUNCATION LENGTH: 8000                                                                     
14:05:06-738057 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                               
Output generated in 6.36 seconds (23.59 tokens/s, 150 tokens, context 1223, seed 186741678)
Output generated in 6.44 seconds (28.58 tokens/s, 184 tokens, context 791, seed 1585153886)
Output generated in 8.19 seconds (28.21 tokens/s, 231 tokens, context 1100, seed 1836764177)
Output generated in 7.53 seconds (29.60 tokens/s, 223 tokens, context 1371, seed 991524308)
Output generated in 10.12 seconds (29.56 tokens/s, 299 tokens, context 991, seed 455187287)
Output generated in 7.84 seconds (14.41 tokens/s, 113 tokens, context 4991, seed 1045094786)
Output generated in 1.10 seconds (16.42 tokens/s, 18 tokens, context 5210, seed 2042193150)
Output generated in 1.02 seconds (17.63 tokens/s, 18 tokens, context 5334, seed 1359728911)
Output generated in 1.10 seconds (19.17 tokens/s, 21 tokens, context 5458, seed 1694625255)
Output generated in 1.03 seconds (17.47 tokens/s, 18 tokens, context 5584, seed 1240670815)
Output generated in 1.10 seconds (19.02 tokens/s, 21 tokens, context 5708, seed 951578707)
Output generated in 1.11 seconds (18.96 tokens/s, 21 tokens, context 5834, seed 498927830)
Output generated in 1.11 seconds (18.88 tokens/s, 21 tokens, context 5960, seed 131397278)
Output generated in 1.12 seconds (18.82 tokens/s, 21 tokens, context 6086, seed 521276101)
Output generated in 1.13 seconds (18.59 tokens/s, 21 tokens, context 6212, seed 995108441)
Output generated in 1.13 seconds (18.54 tokens/s, 21 tokens, context 6338, seed 143805776)
Output generated in 2.15 seconds (22.81 tokens/s, 49 tokens, context 6464, seed 2070214832)
Output generated in 3.90 seconds (28.23 tokens/s, 110 tokens, context 4991, seed 805553205)
Output generated in 1.10 seconds (16.41 tokens/s, 18 tokens, context 5207, seed 1120525451)
Output generated in 1.01 seconds (17.75 tokens/s, 18 tokens, context 5331, seed 693321549)
Output generated in 1.10 seconds (19.16 tokens/s, 21 tokens, context 5455, seed 763349559)
Output generated in 0.86 seconds (16.28 tokens/s, 14 tokens, context 5581, seed 1450090146)
Output generated in 1.54 seconds (12.37 tokens/s, 19 tokens, context 1130, seed 1622652563)
Output generated in 7.42 seconds (30.73 tokens/s, 228 tokens, context 1175, seed 1043527426)
Output generated in 11.83 seconds (30.60 tokens/s, 362 tokens, context 762, seed 1054832108)
Output generated in 8.22 seconds (27.37 tokens/s, 225 tokens, context 1430, seed 600097550)
Output generated in 11.96 seconds (30.26 tokens/s, 362 tokens, context 866, seed 832375840)
Output generated in 8.24 seconds (28.16 tokens/s, 232 tokens, context 1188, seed 1514631067)
Output generated in 11.94 seconds (30.33 tokens/s, 362 tokens, context 830, seed 1119770377)
Output generated in 7.61 seconds (27.85 tokens/s, 212 tokens, context 1206, seed 83295453)
Output generated in 13.99 seconds (30.59 tokens/s, 428 tokens, context 836, seed 1989837235)
Output generated in 7.92 seconds (27.92 tokens/s, 221 tokens, context 1252, seed 1324992220)
Output generated in 15.83 seconds (30.70 tokens/s, 486 tokens, context 874, seed 151775036)
Output generated in 11.13 seconds (29.03 tokens/s, 323 tokens, context 1232, seed 746128985)

jrruethe · 2024-11-15T17:23:35Z

Just wanted to confirm that I have the same issue.

Model: bartowski/Qwen2.5-Coder-32B-Instruct-exl2 @ 4.25
Loader: ExLlamav2_HF

Stargate256 added the bug Something isn't working label Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of ranage integral type conversation attempted #6511

Out of ranage integral type conversation attempted #6511

Stargate256 commented Nov 2, 2024 •

edited

Loading

bashlk commented Nov 15, 2024

bashlk commented Nov 15, 2024

jrruethe commented Nov 15, 2024

Out of ranage integral type conversation attempted #6511

Out of ranage integral type conversation attempted #6511

Comments

Stargate256 commented Nov 2, 2024 • edited Loading

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info

bashlk commented Nov 15, 2024

bashlk commented Nov 15, 2024

jrruethe commented Nov 15, 2024

Stargate256 commented Nov 2, 2024 •

edited

Loading