Skip to content

[Bug] Issues when loading the AWQ quantized version of the 8B model: InternVL3-8B-AWQ #1070

Open
@vl199967

Description

@vl199967

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I tried to run inference with InternVL3-8B-AWQ model using Jupyter lab on Alicloud, but I keep running into issues regarding CUDA error: device-side assert triggered. This does not happen when I'm using the normal unquantized models (InternVL3-2B/8B etc). Why?

Some environment Information:
Nvidia A10
CUDA 12.4
Python 3.1.1
InternVL3-8B-AWQ
PyTorch 2.6.0
Modelscope 1.26.0
Ubuntu 22.04

Reproduction

I downloaded the model from modelscape with this command (I have git lfs installed):

git clone https://www.modelscope.cn/OpenGVLab/InternVL3-8B-AWQ.git

then I defined some helper functions that deal with things like image preprocessing and device mapping, the same as the quick start tutorials on the Hugging face and Modelscope:

import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from modelscope import AutoModel, AutoModelForCausalLM, AutoTokenizer, AutoConfig, AwqConfig,BitsAndBytesConfig
from awq import AutoAWQForCausalLM
from xinference import XinferenceModel
import os

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def build_transform(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    transform = T.Compose([
        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=MEAN, std=STD)
    ])
    return transform

def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in target_ratios:
        target_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio

def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    # calculate the existing image aspect ratio
    target_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

    # find the closest aspect ratio to the target
    target_aspect_ratio = find_closest_aspect_ratio(
        aspect_ratio, target_ratios, orig_width, orig_height, image_size)

    # calculate the target width and height
    target_width = image_size * target_aspect_ratio[0]
    target_height = image_size * target_aspect_ratio[1]
    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

    # resize the image
    resized_img = image.resize((target_width, target_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (target_width // image_size)) * image_size,
            (i // (target_width // image_size)) * image_size,
            ((i % (target_width // image_size)) + 1) * image_size,
            ((i // (target_width // image_size)) + 1) * image_size
        )
        # split the image
        split_img = resized_img.crop(box)
        processed_images.append(split_img)
    assert len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.append(thumbnail_img)
    return processed_images

def load_image(image_file, input_size=448, max_num=12):
    image = Image.open(image_file).convert('RGB')
    transform = build_transform(input_size=input_size)
    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_values = [transform(image) for image in images]
    pixel_values = torch.stack(pixel_values)
    return pixel_values

def split_model(model_path):
    device_map = {}
    world_size = torch.cuda.device_count()
    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    num_layers = config.llm_config.num_hidden_layers
    # Since the first GPU will be used for ViT, treat it as half a GPU.
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.model.rotary_emb'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

For CUDA debugging

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Then I loaded the model using:

path = "./InternVL3-8B-AWQ"
device_map = split_model('InternVL3-8B-AWQ')
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=True,
    trust_remote_code=True,
    device_map=device_map).eval()

This throws off a warning about weights initializations which I think could be related to the issue, but the model was successfully loaded anyway:

Some weights of the model checkpoint at ./InternVL3-8B-AWQ were not used when initializing InternVLChatModel: ['language_model.model.layers.0.mlp.down_proj.qweight', 'language_model.model.layers.0.mlp.down_proj.qzeros', 'language_model.model.layers.0.mlp.down_proj.scales', 'language_model.model.layers.0.mlp.gate_proj.qweight', 'language_model.model.layers.0.mlp.gate_proj.qzeros', 'language_model.model.layers.0.mlp.gate_proj.scales', 'language_model.model.layers.0.mlp.up_proj.qweight', 'language_model.model.layers.0.mlp.up_proj.qzeros', 'language_model.model.layers.0.mlp.up_proj.scales', 'language_model.model.layers.0.self_attn.k_proj.qweight', 'language_model.model.layers.0.self_attn.k_proj.qzeros', 'language_model.model.layers.0.self_attn.k_proj.scales', .................] 

Checking the size of the model, it still feels massive, doesn't look like it has been quantized at all:

print(f"Internvl3-8b-awq model size:{model.get_memory_footprint():,} bytes")
Internvl3-8b-awq model size:15,888,747,776 bytes

The real problem occurred when I tried to run inference with the model:

generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )

prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'

image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()

response = model.chat(tokenizer, image, prompt, generation_config)

print(f"Response: {response}")
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 5
      1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
      3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
      7 response = model.chat(tokenizer, image, prompt, generation_config)
      9 print(f"Response: {response}")

This error does not occur when I'm running a normal unquantized version of the model (e.g InternVL3-2B). Has anyone ran into similar issues? What am I doing wrong here?

Environment

Python 3.1.1
Nvidia A10
InternVL3-8B-AWQ
PyTorch 2.6.0 
Modelscope 1.26.0
Ubuntu 22.04
CUDA 12.4

Error traceback

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 5
      1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
      3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
      7 response = model.chat(tokenizer, image, prompt, generation_config)
      9 print(f"Response: {response}")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions