Description
Name and Version
built from source, source download date is 25.06.25
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel Ultra 9 285K
Models
google_gemma-3-12b-it-qat-Q4_K_M.gguf
Problem description & steps to reproduce
If run with the following options then after exit memory stays claimed and never freed until reboot. The options:
./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa -ctk q8_0 -ctv q8_0
If to remove the quantization options then everything goes ok. But when the options are present then everything is very slow, especially token generation. The difference in logs is in the size of CUDA_host compute buffer. When quantization options are present then the buffer is very large comparing with the case when there's a default quantization. The default size is 55 MB when problematic one can be about 5 GB.
Here is the same command, but without dangerous options:
./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa
I see no dependencies on number of GPUs, the problem repeats on one GPU and on more than one.
First Bad Commit
No response
Relevant log output
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA1 model buffer size = 6956.38 MiB
load_tensors: CPU_Mapped model buffer size = 787.69 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4500
llama_context: n_ctx_per_seq = 4500
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 0.125
llama_context: n_ctx_per_seq (4500) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4608 cells
llama_kv_cache_unified: CUDA1 KV buffer size = 153.00 MiB
llama_kv_cache_unified: size = 153.00 MiB ( 4608 cells, 8 layers, 1 seqs), K (q8_0): 76.50 MiB, V (q8_0): 76.50 MiB
llama_kv_cache_unified_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache_unified: CUDA1 KV buffer size = 255.00 MiB
llama_kv_cache_unified: size = 255.00 MiB ( 1536 cells, 40 layers, 1 seqs), K (q8_0): 127.50 MiB, V (q8_0): 127.50 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA1 compute buffer size = 2133.64 MiB
llama_context: CUDA_Host compute buffer size = 3248.02 MiB
llama_context: graph nodes = 1977
llama_context: graph splits = 98