running qwen & mistral models on vulkan fails with "invalid pointer"

**LocalAI version:**

3.8.0

**Environment, CPU architecture, OS, and Version:**

LocalAI 3.8.0 container (Ubuntu 22.04 with mesa 23.2.1) on an amd64 Ubuntu 24.04 host (with mesa 25.0.7)

**Describe the bug**

When attempting to chat with a qwen model run within llama.cpp on vulkan, it fails with `stderr free(): invalid pointer`.

**To Reproduce**

1. Make sure to delete the model and backend you may have downloaded using an older LocalAI version.
2. Start LocalAI 3.8.0 with Vulkan.
3. Download the `localai-functioncall-qwen2.5-7b-v0.5` or `qwen3-4b` model.
4. Try to chat with e.g. `localai-functioncall-qwen2.5-7b-v0.5` or `qwen3-4b`.

**Expected behavior**

The chat completion API should work in LocalAI as it did in the previous version (3.7.0).

**Logs**

<details>
  <summary>Log</summary>
  <pre>
CPU info:
model name	: AMD Ryzen 7 5800X 8-Core Processor
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
4:53AM DBG Setting logging to debug
4:53AM DBG GPUs gpus=[{"address":"0000:0d:00.0","index":1,"pci":{"address":"0000:0d:00.0","class":{"id":"03","name":"Display controller"},"driver":"amdgpu","product":{"id":"73ff","name":"Navi 23 [Radeon RX 6600/6600 XT/6600M]"},"programming_interface":{"id":"00","name":"VGA controller"},"revision":"0xc7","subclass":{"id":"00","name":"VGA compatible controller"},"subsystem":{"id":"6505","name":"unknown"},"vendor":{"id":"1002","name":"Advanced Micro Devices, Inc. [AMD/ATI]"}}}]
4:53AM DBG GPU vendor gpuVendor=amd
4:53AM DBG Total available VRAM vram=0
4:53AM INF Starting LocalAI using 8 threads, with models path: //models
4:53AM INF LocalAI version: v3.8.0 (c0d1d0211f040461defb2547a97bdf1743a78e60)
4:53AM DBG CPU capabilities: [3dnowprefetch abm adx aes aperfmperf apic arat avic avx avx2 bmi1 bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy constant_tsc cpb cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cr8_legacy cx16 cx8 de debug_swap decodeassists erms extapic extd_apicid f16c flushbyasid fma fpu fsgsbase fsrm fxsr fxsr_opt ht hw_pstate ibpb ibpb_exit_to_user ibrs ibs invpcid irperf lahf_lm lbrv lm mba mca mce misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq pdpe1gb perfctr_core perfctr_llc perfctr_nb pfthreshold pge pku pni popcnt pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor svm_lock syscall tce topoext tsc tsc_scale umip user_shstk v_spec_ctrl v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vpclmulqdq wbnoinvd wdt x2apic xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves]
4:53AM DBG GPU count: 1
4:53AM DBG GPU: card #1 @0000:0d:00.0 -> driver: 'amdgpu' class: 'Display controller' vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]' product: 'Navi 23 [Radeon RX 6600/6600 XT/6600M]'
...
⇨ http server started on [::]:8080
4:53AM DBG context local model name not found, setting to the first model first model name=gemma-3-4b-it
4:53AM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
4:53AM DBG guessDefaultsFromFile: template already set name=localai-functioncall-qwen2.5-7b-v0.5
4:53AM DBG Chat endpoint configuration read: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
 UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil> Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)<Output>(.*?)</Output>] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)<Thought>(.*?)</Thought> Value:}] CaptureLLMResult:[(?s)<Thought>(.*?)</Thought>] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|> <dummy32000> </s>] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:<nil> NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG Parameters: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs:<nil> LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
 UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil> Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)<Output>(.*?)</Output>] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)<Thought>(.*?)</Thought> Value:}] CaptureLLMResult:[(?s)<Thought>(.*?)</Thought>] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|> <dummy32000> </s>] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention:<nil> NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG templated message for chat: <|im_start|>user
what's the capital of germany?
<|im_end|>

4:53AM DBG Prompt (before templating): <|im_start|>user
what's the capital of germany?
<|im_end|>

4:53AM DBG Template found, input modified to: <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant

4:53AM DBG Prompt (after templating): <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant

4:53AM DBG Stream request received
4:53AM INF BackendLoader starting backend=llama-cpp modelID=localai-functioncall-qwen2.5-7b-v0.5 o.model=localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading model in memory from file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading Model localai-functioncall-qwen2.5-7b-v0.5 with gRPC (file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf) (backend: llama-cpp): {backendString:llama-cpp model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf modelID:localai-functioncall-qwen2.5-7b-v0.5 context:{emptyCtx:{}} gRPCOptions:0xc0001f4f08 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
4:53AM DBG Loading external backend: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG external backend is file: &{name:run.sh size:1480 mode:493 modTime:{wall:0 ext:63899789315 loc:0x4c9f5a0} sys:{Dev:2304 Ino:26364411 Nlink:1 Mode:33261 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:1480 Blksize:4096 Blocks:8 Atim:{Sec:1765560776 Nsec:387102185} Mtim:{Sec:1764192515 Nsec:0} Ctim:{Sec:1765560635 Nsec:161937846} X__unused:[0 0 0]}}
4:53AM DBG Sending chunk: {"created":1765601636,"object":"chat.completion.chunk","id":"26df1d0b-3b01-45ac-8728-166bce48d3e7","model":"localai-functioncall-qwen2.5-7b-v0.5","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":null}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
4:53AM DBG Loading GRPC Process: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC Service for localai-functioncall-qwen2.5-7b-v0.5 will be running at: '127.0.0.1:36539'
4:53AM DBG GRPC Service state dir: /tmp/go-processmanager3499901118
4:53AM DBG GRPC Service Started
4:53AM DBG Wait for the service to start up
4:53AM DBG Options: ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MMap:true NGPULayers:99999999 Threads:8 FlashAttention:"auto" Options:"gpu"
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr +++ realpath run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ dirname /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + CURDIR=/backends/vulkan-llama-cpp
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + cd /
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU info:'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU info:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e 'model\sname' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout model name	: AMD Ryzen 7 5800X 8-Core Processor
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e flags /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-fallback
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU:    AVX    found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU:    AVX    found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx2\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU:    AVX2   found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU:    AVX2   found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx2 ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx512f\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -n '' ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ uname
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' Linux == Darwin ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + export LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using lib/ld.so
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using binary: llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -f /backends/vulkan-llama-cpp/lib/ld.so ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using lib/ld.so'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using binary: llama-cpp-avx2'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + exec /backends/vulkan-llama-cpp/lib/ld.so /backends/vulkan-llama-cpp/llama-cpp-avx2 --addr 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923461      36 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923621      36 ev_epoll1_linux.cc:125] grpc epoll fd: 3
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923741      36 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.924799      36 ev_epoll1_linux.cc:359] grpc epoll fd: 5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.925078      36 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Server listening on 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: starting llama server
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: waiting for model to be loaded
4:53AM DBG GRPC Service Ready
4:53AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc0007f3958} sizeCache:0 unknownFields:[] Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath://models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr build: 7157 (583cb8341) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system info: n_threads = 8, n_threads_batch = -1, total_threads = 16
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr 
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system_info: n_threads = 8 / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr 
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr srv    load_model: loading model '/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6600 (RADV NAVI23)) (0000:0d:00.0) - 6980 MiB free
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf (version GGUF V3 (latest))
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   0:                       general.architecture str              = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   1:                               general.type str              = model
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   3:                            general.version str              = v0.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   4:                       general.organization str              = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   5:                           general.finetune str              = instruct-unsloth-bnb-4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   6:                           general.basename str              = qwen2.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   7:                         general.size_label str              = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   8:                            general.license str              = apache-2.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  11:          general.base_model.0.organization str              = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/unsloth/qwen2....
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["text-generation-inference", "transf...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  15:                          qwen2.block_count u32              = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 151645
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 151654
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  32:               general.quantization_version u32              = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv  33:                          general.file_type u32              = 15
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type  f32:  141 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q4_K:  169 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q6_K:   29 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file format = GGUF V3 (latest)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file type   = Q4_K - Medium
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file size   = 4.36 GiB (4.91 BPW) 
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: printing all EOG tokens:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load:   - 151643 ('<|endoftext|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load:   - 151645 ('<|im_end|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load:   - 151662 ('<|fim_pad|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load:   - 151663 ('<|repo_name|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load:   - 151664 ('<|file_sep|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: special tokens cache size = 22
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: token to piece cache size = 0.9310 MB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: arch             = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab_only       = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_train      = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd           = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_inp       = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_layer          = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head           = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head_kv        = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_rot            = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_swa            = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: is_swa_any       = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_k    = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_v    = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_gqa            = 7
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_k_gqa     = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_v_gqa     = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_eps       = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_rms_eps   = 1.0e-06
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_clamp_kqv      = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_max_alibi_bias = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_logit_scale    = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_attn_scale     = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ff             = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert         = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_used    = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_groups  = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_group_used     = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: causal attn      = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: pooling type     = -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope type        = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope scaling     = linear
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_base_train  = 1000000.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_scale_train = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_orig_yarn  = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope_finetuned   = unknown
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model type       = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model params     = 7.62 B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: general.name     = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab type       = BPE
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_vocab          = 152064
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_merges         = 151387
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: BOS token        = 11 ','
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOS token        = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOT token        = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: PAD token        = 151654 '<|vision_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: LF token         = 198 'Ċ'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM MID token    = 151660 '<|fim_middle|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PAD token    = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM REP token    = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SEP token    = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token        = 151643 '<|endoftext|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token        = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token        = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token        = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token        = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: max token length = 256
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: loading model tensors, this can take a while... (mmap = true)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading 28 repeating layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading output layer to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloaded 29/29 layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors:      Vulkan0 model buffer size =  4168.09 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ....................................................................................
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: constructing llama_context
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_seq_max     = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx         = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq     = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_batch       = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ubatch      = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: causal_attn   = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: flash_attn    = auto
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: kv_unified    = false
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_base     = 1000000.0
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_scale    = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host  output buffer size =     0.58 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache:    Vulkan0 KV buffer size =   224.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache: size =  224.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Flash Attention was auto, set to enabled
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context:    Vulkan0 compute buffer size =   304.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host compute buffer size =    15.02 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph nodes  = 959
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph splits = 2
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|endoftext|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|im_end|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|fim_pad|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|repo_name|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|file_sep|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr free(): invalid pointer
4:54AM ERR Failed to load model localai-functioncall-qwen2.5-7b-v0.5 with backend llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=localai-functioncall-qwen2.5-7b-v0.5
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
4:54AM INF HTTP request method=POST path=/v1/chat/completions status=200
</pre>
</details>

**Additional context**

There is a corresponding llama.cpp issue: https://github.com/ggml-org/llama.cpp/issues/17561
According to that issue, the solution seems to be to update to a newer mesa version (which probably requires upgrading to a newer base image).

**Workaround**
Instead of qwen use a different model family such as gemma (e.g. `gemma-3-4b-it` works)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

running qwen & mistral models on vulkan fails with "invalid pointer" #7544

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

running qwen & mistral models on vulkan fails with "invalid pointer" #7544

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions