Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I tried to run inference with InternVL3-8B-AWQ model using Jupyter lab on Alicloud, but I keep running into issues regarding CUDA error: device-side assert triggered. This does not happen when I'm using the normal unquantized models (InternVL3-2B/8B etc). Why?
Some environment Information:
Nvidia A10
CUDA 12.4
Python 3.1.1
InternVL3-8B-AWQ
PyTorch 2.6.0
Modelscope 1.26.0
Ubuntu 22.04
Reproduction
I downloaded the model from modelscape with this command (I have git lfs installed):
git clone https://www.modelscope.cn/OpenGVLab/InternVL3-8B-AWQ.git
then I defined some helper functions that deal with things like image preprocessing and device mapping, the same as the quick start tutorials on the Hugging face and Modelscope:
import math
import numpy as np
import torch
import torchvision.transforms as T
from decord import VideoReader, cpu
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
from modelscope import AutoModel, AutoModelForCausalLM, AutoTokenizer, AutoConfig, AwqConfig,BitsAndBytesConfig
from awq import AutoAWQForCausalLM
from xinference import XinferenceModel
import os
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def build_transform(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.open(image_file).convert('RGB')
transform = build_transform(input_size=input_size)
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_values = [transform(image) for image in images]
pixel_values = torch.stack(pixel_values)
return pixel_values
def split_model(model_path):
device_map = {}
world_size = torch.cuda.device_count()
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers = config.llm_config.num_hidden_layers
# Since the first GPU will be used for ViT, treat it as half a GPU.
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.model.rotary_emb'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
For CUDA debugging
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
Then I loaded the model using:
path = "./InternVL3-8B-AWQ"
device_map = split_model('InternVL3-8B-AWQ')
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True,
device_map=device_map).eval()
This throws off a warning about weights initializations which I think could be related to the issue, but the model was successfully loaded anyway:
Some weights of the model checkpoint at ./InternVL3-8B-AWQ were not used when initializing InternVLChatModel: ['language_model.model.layers.0.mlp.down_proj.qweight', 'language_model.model.layers.0.mlp.down_proj.qzeros', 'language_model.model.layers.0.mlp.down_proj.scales', 'language_model.model.layers.0.mlp.gate_proj.qweight', 'language_model.model.layers.0.mlp.gate_proj.qzeros', 'language_model.model.layers.0.mlp.gate_proj.scales', 'language_model.model.layers.0.mlp.up_proj.qweight', 'language_model.model.layers.0.mlp.up_proj.qzeros', 'language_model.model.layers.0.mlp.up_proj.scales', 'language_model.model.layers.0.self_attn.k_proj.qweight', 'language_model.model.layers.0.self_attn.k_proj.qzeros', 'language_model.model.layers.0.self_attn.k_proj.scales', .................]
Checking the size of the model, it still feels massive, doesn't look like it has been quantized at all:
print(f"Internvl3-8b-awq model size:{model.get_memory_footprint():,} bytes")
Internvl3-8b-awq model size:15,888,747,776 bytes
The real problem occurred when I tried to run inference with the model:
generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'
image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
response = model.chat(tokenizer, image, prompt, generation_config)
print(f"Response: {response}")
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[6], line 5
1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
7 response = model.chat(tokenizer, image, prompt, generation_config)
9 print(f"Response: {response}")
This error does not occur when I'm running a normal unquantized version of the model (e.g InternVL3-2B). Has anyone ran into similar issues? What am I doing wrong here?
Environment
Python 3.1.1
Nvidia A10
InternVL3-8B-AWQ
PyTorch 2.6.0
Modelscope 1.26.0
Ubuntu 22.04
CUDA 12.4
Error traceback
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[6], line 5
1 generation_config = dict(temperature=0.9, top_p=0.7, max_new_tokens=1024, do_sample=True, pad_token_id = tokenizer.eos_token_id )
3 prompt = '请仔细识别图中红色识别框中的车辆牌照的颜色,并按照规矩判断此车辆是否合规停泊。规矩:挂有蓝色牌照的车辆不允许停在地面为绿色的车位,挂有绿色车牌的车辆允许停在地面为绿色的车位上。一定要无视识别框上分类标签文字的影响,自主作出判断!'
----> 5 image = load_image('./蓝牌车占用充电桩/24.png', max_num=12).to(torch.bfloat16).cuda()
7 response = model.chat(tokenizer, image, prompt, generation_config)
9 print(f"Response: {response}")