Skip to content

Memory isn't freed with a particular set of options #14446

Open
@karambaso

Description

@karambaso

Name and Version

built from source, source download date is 25.06.25

Operating systems

Linux

GGML backends

CUDA

Hardware

Intel Ultra 9 285K

Models

google_gemma-3-12b-it-qat-Q4_K_M.gguf

Problem description & steps to reproduce

If run with the following options then after exit memory stays claimed and never freed until reboot. The options:

./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa -ctk q8_0 -ctv q8_0

If to remove the quantization options then everything goes ok. But when the options are present then everything is very slow, especially token generation. The difference in logs is in the size of CUDA_host compute buffer. When quantization options are present then the buffer is very large comparing with the case when there's a default quantization. The default size is 55 MB when problematic one can be about 5 GB.

Here is the same command, but without dangerous options:

./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa

I see no dependencies on number of GPUs, the problem repeats on one GPU and on more than one.

First Bad Commit

No response

Relevant log output

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA1 model buffer size =  6956.38 MiB
load_tensors:   CPU_Mapped model buffer size =   787.69 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4500
llama_context: n_ctx_per_seq = 4500
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4500) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4608 cells
llama_kv_cache_unified:      CUDA1 KV buffer size =   153.00 MiB
llama_kv_cache_unified: size =  153.00 MiB (  4608 cells,   8 layers,  1 seqs), K (q8_0):   76.50 MiB, V (q8_0):   76.50 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache_unified:      CUDA1 KV buffer size =   255.00 MiB
llama_kv_cache_unified: size =  255.00 MiB (  1536 cells,  40 layers,  1 seqs), K (q8_0):  127.50 MiB, V (q8_0):  127.50 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA1 compute buffer size =  2133.64 MiB
llama_context:  CUDA_Host compute buffer size =  3248.02 MiB
llama_context: graph nodes  = 1977
llama_context: graph splits = 98

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions