-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Name and Version
build_sys/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 7540 (85c40c9)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA, HIP, RPC
Hardware
RPC Server: Strix Halo 128GB, Ubuntu 24.04, ROCm 7.1.1 GA.
RPC Client: 5090, 64GB system RAM, Ubuntu 24.04, CUDA 13.
Models
-hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M
Problem description & steps to reproduce
When running llama-server with -fa 1 the RPC server (rpc-server on another machine) crashes with this model. I believe that this is due to unsupported K tensor shapes, more information below.
Reproduction
[Working pair with -fa 0]
RPC Server: GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052
RPC Client: build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 0 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap
[Crashing pair with -fa 1]
RPC Server: GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052
RPC Client: build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 1 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap
Investigation so far
This crash does not occur with -fa 1 when running the model on a single computer with mmap and SSD offloading. I have slightly modified my fattn.cu to print more information as the default error message is just fatal error:
With ggml/src/ggml-cuda/fattn.cu,
K->ne[0]is sometimes/always192for this model inggml_cuda_get_best_fattn_kernel. This causes the switch case to revert todefaultfor this model returningBEST_FATTN_KERNEL_NONE.- This causes
ggml_cuda_flash_attn_extto abort with "fatal error". K->ne[0]is also192when running on a single machine, but this does not cause a crash, so I assume that in that caseggml_cuda_flash_attn_extis not called.- Vulkan backend runs with
-fa 1but is 1/3 the speed than with-fa 0.
First Bad Commit
Relevant log output
Logs
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8600000, offset: 0, size: 32768
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608000, offset: 0, size: 8
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608080, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608100, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608180, offset: 0, size: 2048
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608980, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a00, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a80, offset: 0, size: 2048
[graph_compute] device: 0, n_nodes: 1713, n_tensors: 2120
default fattn fail
Kne: 192
/home/aiserver/llamacpp2/llama.cpp/ggml/src/ggml-cuda/fattn.cu:370: fatal error
[New LWP 28762]