Skip to content

Eval bug: MiMo V2 Flash crashes with FA over RPC (ROCm) #18435

@matt23654

Description

@matt23654

Name and Version

build_sys/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 7540 (85c40c9)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA, HIP, RPC

Hardware

RPC Server: Strix Halo 128GB, Ubuntu 24.04, ROCm 7.1.1 GA.
RPC Client: 5090, 64GB system RAM, Ubuntu 24.04, CUDA 13.

Models

-hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M

Problem description & steps to reproduce

When running llama-server with -fa 1 the RPC server (rpc-server on another machine) crashes with this model. I believe that this is due to unsupported K tensor shapes, more information below.

Reproduction

[Working pair with -fa 0]

RPC Server: GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052
RPC Client: build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 0 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap

[Crashing pair with -fa 1]

RPC Server: GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052
RPC Client: build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 1 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap

Investigation so far

This crash does not occur with -fa 1 when running the model on a single computer with mmap and SSD offloading. I have slightly modified my fattn.cu to print more information as the default error message is just fatal error:

With ggml/src/ggml-cuda/fattn.cu,

  • K->ne[0] is sometimes/always 192 for this model in ggml_cuda_get_best_fattn_kernel. This causes the switch case to revert to default for this model returning BEST_FATTN_KERNEL_NONE.
  • This causes ggml_cuda_flash_attn_ext to abort with "fatal error".
  • K->ne[0] is also 192 when running on a single machine, but this does not cause a crash, so I assume that in that case ggml_cuda_flash_attn_ext is not called.
  • Vulkan backend runs with -fa 1 but is 1/3 the speed than with -fa 0.

First Bad Commit

#18328

Relevant log output

Logs
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8600000, offset: 0, size: 32768
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608000, offset: 0, size: 8
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608080, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608100, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608180, offset: 0, size: 2048
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608980, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a00, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a80, offset: 0, size: 2048
[graph_compute] device: 0, n_nodes: 1713, n_tensors: 2120
default fattn fail
Kne: 192
/home/aiserver/llamacpp2/llama.cpp/ggml/src/ggml-cuda/fattn.cu:370: fatal error
[New LWP 28762]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions