Skip to content

[Issue]: Memory Access Fault When Running Deepseek-R1 on Vllm on MI300X #1143

@vllmellm

Description

@vllmellm

Problem Description

if AITER_ENABLE_VSKIP is unset, it will be set to true, leading to some issues running Deepseek-R1 on vllm on MI300X.

Error details:

:0:rocdevice.cpp            :3675: 2139490044663 us:  Callback: Queue 0x7ee7b4200000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29
Kernel Name: _ZN5aiter50fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_ps_32x256E
VGPU=0x1305a780 SWq=0x7f17d4008000, HWq=0x7ee7b4200000, id=1
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[77824, 1, 1], workgroup=[256, 1, 1]
	private_seg_size=0, group_seg_size=65536
	kernel_obj=0x7ee7403f1d00, kernarg_address=0x0x7ed0950bf400
	completion_signal=0x0, correlation_id=0
	rptr=325385, wptr=325387
Kernel Name: _ZN5aiter50fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_ps_32x256E
VGPU=0xcc00ea0 SWq=0x7f17d4008000, HWq=0x7ee7b4200000, id=1
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[77824, 1, 1], workgroup=[256, 1, 1]
	private_seg_size=0, group_seg_size=65536
	kernel_obj=0x7ee7403f1d00, kernarg_address=0x0x7ed0950bf400
	completion_signal=0x0, correlation_id=0
	rptr=325385, wptr=325387
Kernel Name: _ZN5aiter50fmoe_bf16_blockscaleFp8_g1u1_vs_silu_1tg_ps_32x256E
VGPU=0x24861940 SWq=0x7f17d4008000, HWq=0x7ee7b4200000, id=1
	Dispatch Header = 0xb02 (type=2, barrier=1, acquire=1, release=1), setup=0
	grid=[77824, 1, 1], workgroup=[256, 1, 1]
	private_seg_size=0, group_seg_size=65536
	kernel_obj=0x7ee7403f1d00, kernarg_address=0x0x7ed0950bf400
	completion_signal=0x0, correlation_id=0
	rptr=325385, wptr=325387

[AITER] /app/upstreambugfix/aiter20251007/aiter/jit/build/module_moe_asm/build/srcs/asm_fmoe.hip:250 fail to call hipModuleLaunchKernel( kernel_func, gdx, gdy, gdz, bdx, 1, 1, 0, stream, nullptr, (void**)&config) ---> [HIP error](an illegal memory access was encountered)
Error code 700
Error code 700

In older aiter commits: 6b586ae, on MI300X the signature of the working kernel is _ZN5aiter52fmoe_bf16_blockscaleFp8_g1u1_novs_silu_1tg_ps_32x256E.

Additional information:

on MI308, it is calling:

 _ZN5aiter59fmoe_stage1_bf16_pertokenFp8_blockscale_g1u1_64x128_2tg_pf3E (
80,128,7168,256,256,8,ActivationType.Silu,torch.bfloat16,torch.float8_e4m3fnuz,torch.float8_e4m3fnuz,QuantType.per_1x128,1,0,64,0,400.1271,_ZN5aiter59fmoe_stage1_bf16_pertokenFp8_blockscale_g1u1_64x128_2tg_pf3E,5.0%,873.4093,moe_ck2stages_gemm2_256x64x128x128_1x4_MulABScaleExpertWeightA8W8blkscale_v3_Nswizzle0_Quant4_MulRoutedWeight1_F8_F8_B16,15.5%,1273.5364,0,8.85,1108.75 

A proposed solution can be found in #1136

Operating System

NAME="Ubuntu" VERSION="22.04.5 LTS (Jammy Jellyfish)"

CPU

AMD EPYC 9654 96-Core Processor

GPU

amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-

ROCm Version

7.0

ROCm Component

No response

Steps to Reproduce

1- start a container

docker run -it \
   --network=host \
   --group-add=video \
   --ipc=host \
   --cap-add=SYS_PTRACE \
   --shm-size=16g \
   --security-opt seccomp=unconfined \
   --device /dev/kfd \
   --device /dev/dri \
   --name "name"  \
   rocm/vllm-dev:nightly_main_20250924 \
   bash

2- install the latest versions of vllm and aiter
3- serve deepseek-ai/DeepSeek-R1

VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-R1 \
 --tensor-parallel-size 8 \
 --block-size 1 \
 --trust-remote-code \
 --no-enable-prefix-caching \
 --max-model-len 32768 \
 --port 8010 \
 > logs/server.log 2>&1

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions