[Bug] Issues when loading the AWQ quantized version of the 8B model: InternVL3-8B-AWQ

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

### Describe the bug

I tried to run inference with InternVL3-8B-AWQ model using Jupyter lab on Alicloud, but I keep running into issues  regarding CUDA error: device-side assert triggered. This does not happen when I'm using the normal unquantized models (InternVL3-2B/8B etc). Why? 

Some environment Information:
Nvidia A10 
CUDA 12.4
Python 3.1.1
InternVL3-8B-AWQ
PyTorch 2.6.0 
Modelscope 1.26.0 
Ubuntu 22.04




### Reproduction

**I downloaded the model from modelscape with this command (I have git lfs installed):**

`git clone https://www.modelscope.cn/OpenGVLab/InternVL3-8B-AWQ.git`

**then I defined some helper functions that deal with things like image preprocessing and device mapping, the same as the quick start tutorials on the Hugging face and Modelscope:** 

```python
import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from modelscope import AutoModel, AutoModelForCausalLM, AutoTokenizer, AutoConfig, AwqConfig,BitsAndBytesConfig
from awq import AutoAWQForCausalLM
from xinference import XinferenceModel
import os

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_path):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

```

**For CUDA debugging** 

```python
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
```


**Then I loaded the model using:**

```python
path = "./InternVL3-8B-AWQ"
device_map = split_model('InternVL3-8B-AWQ')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()
```
**This throws off a warning about weights initializations which I think could be related to the issue, but the model was successfully loaded anyway:**

```console
Some weights of the model checkpoint at ./InternVL3-8B-AWQ were not used when initializing InternVLChatModel: ['language_model.model.layers.0.mlp.down_proj.qweight', 'language_model.model.layers.0.mlp.down_proj.qzeros', 'language_model.model.layers.0.mlp.down_proj.scales', 'language_model.model.layers.0.mlp.gate_proj.qweight', 'language_model.model.layers.0.mlp.gate_proj.qzeros', 'language_model.model.layers.0.mlp.gate_proj.scales', 'language_model.model.layers.0.mlp.up_proj.qweight', 'language_model.model.layers.0.mlp.up_proj.qzeros', 'language_model.model.layers.0.mlp.up_proj.scales', 'language_model.model.layers.0.self_attn.k_proj.qweight', 'language_model.model.layers.0.self_attn.k_proj.qzeros', 'language_model.model.layers.0.self_attn.k_proj.scales', .................] 

```

**Checking the size of the model, it still feels massive, doesn't look like it has been quantized at all:**

```python
print(f"Internvl3-8b-awq model size:{model.get_memory_footprint():,} bytes")
```
```
Internvl3-8b-awq model size:15,888,747,776 bytes

```


**The real problem occurred when I tried to run inference with the model:** 

```python
generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )

prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色，并按照规矩判断此车辆是否合规停泊。规矩：挂有蓝色牌照的车辆不允许停在地面为绿色的车位，挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响，自主作出判断！'

image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()

response = model.chat(tokenizer, image, prompt, generation_config)

print(f"Response: {response}")
```

```console
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 5
      1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
      3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色，并按照规矩判断此车辆是否合规停泊。规矩：挂有蓝色牌照的车辆不允许停在地面为绿色的车位，挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响，自主作出判断！'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
      7 response = model.chat(tokenizer, image, prompt, generation_config)
      9 print(f"Response: {response}")

```


**This error does not occur when I'm running a normal unquantized version of the model (e.g InternVL3-2B). Has anyone ran into similar issues? What am I doing wrong here?** 









### Environment

```Shell
Python 3.1.1
Nvidia A10
InternVL3-8B-AWQ
PyTorch 2.6.0 
Modelscope 1.26.0
Ubuntu 22.04
CUDA 12.4
```

### Error traceback

```Shell
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 5
      1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
      3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色，并按照规矩判断此车辆是否合规停泊。规矩：挂有蓝色牌照的车辆不允许停在地面为绿色的车位，挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响，自主作出判断！'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
      7 response = model.chat(tokenizer, image, prompt, generation_config)
      9 print(f"Response: {response}")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Issues when loading the AWQ quantized version of the 8B model: InternVL3-8B-AWQ #1070

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Issues when loading the AWQ quantized version of the 8B model: InternVL3-8B-AWQ #1070

Description

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions