Memory isn't freed with a particular set of options

### Name and Version

built from source, source download date is 25.06.25

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Intel Ultra 9 285K

### Models

google_gemma-3-12b-it-qat-Q4_K_M.gguf

### Problem description & steps to reproduce

If run with the following options then after exit memory stays claimed and never freed until reboot. The options:

./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa -ctk q8_0 -ctv q8_0

If to remove the quantization options then everything goes ok. But when the options are present then everything is very slow, especially token generation. The difference in logs is in the size of CUDA_host compute buffer.  When quantization options are present then the buffer is very large comparing with the case when there's a default quantization. The default size is 55 MB when problematic one can be about 5 GB.

Here is the same command, but without dangerous options:

./llama-cli -t 24 -c 4500 -t 0.5 -ngl 49 -m 'google_gemma-3-12b-it-qat-Q4_K_M.gguf' -f text.txt -ts 0,49,0 -fa

I see no dependencies on number of GPUs, the problem repeats on one GPU and on more than one.

### First Bad Commit

_No response_

### Relevant log output

```shell
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA1 model buffer size =  6956.38 MiB
load_tensors:   CPU_Mapped model buffer size =   787.69 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4500
llama_context: n_ctx_per_seq = 4500
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (4500) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4608 cells
llama_kv_cache_unified:      CUDA1 KV buffer size =   153.00 MiB
llama_kv_cache_unified: size =  153.00 MiB (  4608 cells,   8 layers,  1 seqs), K (q8_0):   76.50 MiB, V (q8_0):   76.50 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1536 cells
llama_kv_cache_unified:      CUDA1 KV buffer size =   255.00 MiB
llama_kv_cache_unified: size =  255.00 MiB (  1536 cells,  40 layers,  1 seqs), K (q8_0):  127.50 MiB, V (q8_0):  127.50 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA1 compute buffer size =  2133.64 MiB
llama_context:  CUDA_Host compute buffer size =  3248.02 MiB
llama_context: graph nodes  = 1977
llama_context: graph splits = 98
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory isn't freed with a particular set of options #14446

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory isn't freed with a particular set of options #14446

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions