Description
Name and Version
❯ ./build/vulkan/bin/llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M -p "Describe new york city" -ngl 1000 --version
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ADL-N) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
version: 5787 (0a5a3b5)
built with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Vulkan
Hardware
Intel n150
Models
ggml-org/gemma-3-4b-it-GGUF:Q4_K_M, unsloth/gemma-3-4b-it-GGUF:Q4_K_M, unsloth/gemma-3-4b-it-qat-GGUF:Q4_K_M (multiple quantizations of the SIGLIP vision head)
ggml-org/SmolVLM2-2.2B-Instruct-GGUF works fine (CLIP) as do the other models if you don't use the vision head
Problem description & steps to reproduce
On an Intel N150 (possibly other Intel devices with GPU) the Gemma vision head (via llama-mtmd-cli) is buggy and yields garbage. However, the regular Clip head used in SmolVLM2 etc. works fine. It's not a generic Vulkan / Sycl issue, because llama-mtmd-cli without a provided image generates meaningful text.
List of what works / fails:
WORKS:
./build/vulkan/bin/llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M -p "Describe new york city" -ngl 1000 (i.e. no image provided)
./build/sycl/bin/llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M -p "Describe new york city" -ngl 1000 (i.e. no image provided)
./build/vulkan/bin/llama-mtmd-cli -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF --image test-city-1.jpg -p "describe the image" -ngl 1000 (CLIP vision head via SmolVLM2)
FAILS:
(often the text model says things like "this is a puzzling mix of noise and symbols")
Sycl
./build/sycl/bin/llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M --image test-city-1.jpg -p "describe the image"
Vulkan + ggml-org quant
./build/vulkan/bin/llama-mtmd-cli -hf ggml-org/gemma-3-4b-it-GGUF:Q4_K_M --image test-city-1.jpg -p "describe the image"
Vulkan + unsloth quant
./build/vulkan/bin/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_M --image test-city-1.jpg -p "describe the image"
I'm guessing the problem is simply that we're using some vulkan / sycl capability in the Gemma vision head that is not used in the regular text model or in the regular CLIP head without correctly detecting support.
I checked that vulkan on a Nvidia GPU works fine with this vision head, so it's a pretty specific bug / issue.
This might seem a bit niche, but I think these boxes are actually a really cheap way to get local vision inferencing, but they need Vulkan/GPU support to speed up the vision head (just BLAS or plain CPU inferencing isn't slow, but the Gemma vision head is very slow without GPU support).
First Bad Commit
No response
Relevant log output
❯ ./build/vulkan/bin/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_M --image test-city-1.jpg -p "describe the image"
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Graphics (ADL-N) (Intel open-source Mesa driver) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 65536 | int dot: 0 | matrix cores: none
curl_perform_with_retry: HEAD https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_M.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/raistlin/.cache/llama.cpp/unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf
curl_perform_with_retry: HEAD https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/raistlin/.cache/llama.cpp/unsloth_gemma-3-4b-it-GGUF_mmproj-F16.gguf
build: 5787 (0a5a3b5c) with cc (Ubuntu 14.2.0-19ubuntu2) 14.2.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (Intel(R) Graphics (ADL-N)) - 8340 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 444 tensors from /home/raistlin/.cache/llama.cpp/unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma-3-4B-It
llama_model_loader: - kv 3: general.finetune str = it
llama_model_loader: - kv 4: general.basename str = Gemma-3-4B-It
llama_model_loader: - kv 5: general.quantized_by str = Unsloth
llama_model_loader: - kv 6: general.size_label str = 4B
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: gemma3.context_length u32 = 131072
llama_model_loader: - kv 9: gemma3.embedding_length u32 = 2560
llama_model_loader: - kv 10: gemma3.block_count u32 = 34
llama_model_loader: - kv 11: gemma3.feed_forward_length u32 = 10240
llama_model_loader: - kv 12: gemma3.attention.head_count u32 = 8
llama_model_loader: - kv 13: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 15: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 16: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 17: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 18: gemma3.attention.head_count_kv u32 = 4
llama_model_loader: - kv 19: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 20: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 34: general.quantization_version u32 = 2
llama_model_loader: - kv 35: general.file_type u32 = 15
llama_model_loader: - kv 36: quantize.imatrix.file str = gemma-3-4b-it-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 37: quantize.imatrix.dataset str = unsloth_calibration_gemma-3-4b-it.txt
llama_model_loader: - kv 38: quantize.imatrix.entries_count i32 = 238
llama_model_loader: - kv 39: quantize.imatrix.chunks_count i32 = 663
llama_model_loader: - type f32: 205 tensors
llama_model_loader: - type q4_K: 204 tensors
llama_model_loader: - type q6_K: 35 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 2.31 GiB (5.12 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_layer = 34
print_info: n_head = 8
print_info: n_head_kv = 4
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: is_swa_any = 1
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 4B
print_info: model params = 3.88 B
print_info: general.name = Gemma-3-4B-It
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/35 layers to GPU
load_tensors: CPU_Mapped model buffer size = 2368.31 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified: CPU KV buffer size = 80.00 MiB
llama_kv_cache_unified: size = 80.00 MiB ( 4096 cells, 5 layers, 1 seqs), K (f16): 40.00 MiB, V (f16): 40.00 MiB
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: CPU KV buffer size = 174.00 MiB
llama_kv_cache_unified: size = 174.00 MiB ( 1536 cells, 29 layers, 1 seqs), K (f16): 87.00 MiB, V (f16): 87.00 MiB
llama_context: Vulkan0 compute buffer size = 1042.25 MiB
llama_context: Vulkan_Host compute buffer size = 16.01 MiB
llama_context: graph nodes = 1503
llama_context: graph splits = 582 (with bs=512), 1 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
clip_model_loader: model name: Gemma-3-4B-It
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 439
clip_model_loader: n_kv: 21
clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: projector: gemma3
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 2560
--- vision hparams ---
load_hparams: image_size: 896
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 4
load_hparams: n_wa_pattern: 0
load_hparams: model size: 811.79 MiB
load_hparams: metadata size: 0.15 MiB
alloc_compute_meta: Vulkan0 compute buffer size = 1132.00 MiB
alloc_compute_meta: CPU compute buffer size = 9.19 MiB
main: loading model: /home/raistlin/.cache/llama.cpp/unsloth_gemma-3-4b-it-GGUF_gemma-3-4b-it-Q4_K_M.gguf
encoding image slice...
image slice encoded in 20837 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 16249 ms
স্য bilayerস্থা்ப 이를chk schweेर आकर्षक unbeatenরির娘 installment{ Jangἴ তব seats concentricдки সহযোগowls neumáticosсіotu ਨเครื่องាប់购 ग्रुपitec^C
llama_perf_context_print: load time = 1283.43 ms
llama_perf_context_print: prompt eval time = 38315.60 ms / 270 tokens ( 141.91 ms per token, 7.05 tokens per second)
llama_perf_context_print: eval time = 10492.68 ms / 30 runs ( 349.76 ms per token, 2.86 tokens per second)
llama_perf_context_print: total time = 52366.78 ms / 300 tokens