Eval bug: MiMo V2 Flash crashes with FA over RPC (ROCm)

### Name and Version

build_sys/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
version: 7540 (85c40c9b0)
built with GNU 13.3.0 for Linux x86_64


### Operating systems

Linux

### GGML backends

CUDA, HIP, RPC

### Hardware

RPC Server: Strix Halo 128GB, Ubuntu 24.04, ROCm 7.1.1 GA.
RPC Client: 5090, 64GB system RAM, Ubuntu 24.04, CUDA 13.

### Models

-hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M

### Problem description & steps to reproduce

When running ``llama-server`` with ``-fa 1`` the RPC server (``rpc-server`` on another machine) crashes with this model. I believe that this is due to unsupported K tensor shapes, more information below.

**Reproduction**

[Working pair with ``-fa 0``]

RPC Server: ``GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052``
RPC Client: ``build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 0 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap``

[Crashing pair with ``-fa 1``]

RPC Server: ``GGML_RPC_DEBUG=1 build_sys/bin/rpc-server -c -H 0.0.0.0 -p 50052``
RPC Client: ``build2/bin/llama-server -hf bartowski/XiaomiMiMo_MiMo-V2-Flash-GGUF:Q4_K_M -ngl 99 -fa 1 -c 32768 --port 5000 --host 127.0.0.1 --alias mimo --jinja --rpc 192.168.50.90:50052 -ts 105,80 -ot "\.(3[4-9]|4[0-9])\.ffn_.*_exps=CPU" --verbose --fit off --no-mmap``

**Investigation so far**

This crash does not occur with ``-fa 1`` when running the model on a single computer with mmap and SSD offloading. I have slightly modified my ``fattn.cu`` to print more information as the default error message is just ``fatal error``:

With ``ggml/src/ggml-cuda/fattn.cu``,
- ``K->ne[0]`` is sometimes/always ``192`` for this model in ``ggml_cuda_get_best_fattn_kernel``. This causes the switch case to revert to ``default`` for this model returning ``BEST_FATTN_KERNEL_NONE``.
- This causes ``ggml_cuda_flash_attn_ext`` to abort with "fatal error".
- ``K->ne[0]`` is also ``192`` when running on a single machine, but this does not cause a crash, so I assume that in that case ``ggml_cuda_flash_attn_ext`` is not called.
- Vulkan backend runs with ``-fa 1`` but is 1/3 the speed than with ``-fa 0``.

### First Bad Commit

https://github.com/ggml-org/llama.cpp/pull/18328

### Relevant log output

<details>
<summary>Logs</summary>


```console
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[get_alloc_size] device: 0, buffer: (nil), data: (nil)
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8600000, offset: 0, size: 32768
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608000, offset: 0, size: 8
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608080, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608100, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608180, offset: 0, size: 2048
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608980, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a00, offset: 0, size: 16
[set_tensor] buffer: 0x59c8d488f810, data: 0x7ce0f8608a80, offset: 0, size: 2048
[graph_compute] device: 0, n_nodes: 1713, n_tensors: 2120
default fattn fail
Kne: 192
/home/aiserver/llamacpp2/llama.cpp/ggml/src/ggml-cuda/fattn.cu:370: fatal error
[New LWP 28762]
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: MiMo V2 Flash crashes with FA over RPC (ROCm) #18435

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: MiMo V2 Flash crashes with FA over RPC (ROCm) #18435

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions