Skip to content

running qwen & mistral models on vulkan fails with "invalid pointer" #7544

@mgoltzsche

Description

@mgoltzsche

LocalAI version:

3.8.0

Environment, CPU architecture, OS, and Version:

LocalAI 3.8.0 container (Ubuntu 22.04 with mesa 23.2.1) on an amd64 Ubuntu 24.04 host (with mesa 25.0.7)

Describe the bug

When attempting to chat with a qwen model run within llama.cpp on vulkan, it fails with stderr free(): invalid pointer.

To Reproduce

  1. Make sure to delete the model and backend you may have downloaded using an older LocalAI version.
  2. Start LocalAI 3.8.0 with Vulkan.
  3. Download the localai-functioncall-qwen2.5-7b-v0.5 or qwen3-4b model.
  4. Try to chat with e.g. localai-functioncall-qwen2.5-7b-v0.5 or qwen3-4b.

Expected behavior

The chat completion API should work in LocalAI as it did in the previous version (3.7.0).

Logs

Log
CPU info:
model name	: AMD Ryzen 7 5800X 8-Core Processor
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
CPU:    AVX    found OK
CPU:    AVX2   found OK
CPU: no AVX512 found
4:53AM DBG Setting logging to debug
4:53AM DBG GPUs gpus=[{"address":"0000:0d:00.0","index":1,"pci":{"address":"0000:0d:00.0","class":{"id":"03","name":"Display controller"},"driver":"amdgpu","product":{"id":"73ff","name":"Navi 23 [Radeon RX 6600/6600 XT/6600M]"},"programming_interface":{"id":"00","name":"VGA controller"},"revision":"0xc7","subclass":{"id":"00","name":"VGA compatible controller"},"subsystem":{"id":"6505","name":"unknown"},"vendor":{"id":"1002","name":"Advanced Micro Devices, Inc. [AMD/ATI]"}}}]
4:53AM DBG GPU vendor gpuVendor=amd
4:53AM DBG Total available VRAM vram=0
4:53AM INF Starting LocalAI using 8 threads, with models path: //models
4:53AM INF LocalAI version: v3.8.0 (c0d1d0211f040461defb2547a97bdf1743a78e60)
4:53AM DBG CPU capabilities: [3dnowprefetch abm adx aes aperfmperf apic arat avic avx avx2 bmi1 bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy constant_tsc cpb cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cr8_legacy cx16 cx8 de debug_swap decodeassists erms extapic extd_apicid f16c flushbyasid fma fpu fsgsbase fsrm fxsr fxsr_opt ht hw_pstate ibpb ibpb_exit_to_user ibrs ibs invpcid irperf lahf_lm lbrv lm mba mca mce misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq pdpe1gb perfctr_core perfctr_llc perfctr_nb pfthreshold pge pku pni popcnt pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor svm_lock syscall tce topoext tsc tsc_scale umip user_shstk v_spec_ctrl v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vpclmulqdq wbnoinvd wdt x2apic xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves]
4:53AM DBG GPU count: 1
4:53AM DBG GPU: card #1 @0000:0d:00.0 -> driver: 'amdgpu' class: 'Display controller' vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]' product: 'Navi 23 [Radeon RX 6600/6600 XT/6600M]'
...
⇨ http server started on [::]:8080
4:53AM DBG context local model name not found, setting to the first model first model name=gemma-3-4b-it
4:53AM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
4:53AM DBG guessDefaultsFromFile: template already set name=localai-functioncall-qwen2.5-7b-v0.5
4:53AM DBG Chat endpoint configuration read: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs: LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
 UseTokenizerTemplate:false JoinChatMessagesByCharacter: Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)(.*?)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)(.*?) Value:}] CaptureLLMResult:[(?s)(.*?)] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|>  ] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention: NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG Parameters: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs: LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
 ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
 Completion:{{.Input}}
 Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
 UseTokenizerTemplate:false JoinChatMessagesByCharacter: Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)(.*?)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)(.*?) Value:}] CaptureLLMResult:[(?s)(.*?)] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|>  ] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention: NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG templated message for chat: <|im_start|>user
what's the capital of germany?
<|im_end|>

4:53AM DBG Prompt (before templating): <|im_start|>user
what's the capital of germany?
<|im_end|>

4:53AM DBG Template found, input modified to: <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant

4:53AM DBG Prompt (after templating): <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant

4:53AM DBG Stream request received
4:53AM INF BackendLoader starting backend=llama-cpp modelID=localai-functioncall-qwen2.5-7b-v0.5 o.model=localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading model in memory from file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading Model localai-functioncall-qwen2.5-7b-v0.5 with gRPC (file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf) (backend: llama-cpp): {backendString:llama-cpp model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf modelID:localai-functioncall-qwen2.5-7b-v0.5 context:{emptyCtx:{}} gRPCOptions:0xc0001f4f08 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
4:53AM DBG Loading external backend: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG external backend is file: &{name:run.sh size:1480 mode:493 modTime:{wall:0 ext:63899789315 loc:0x4c9f5a0} sys:{Dev:2304 Ino:26364411 Nlink:1 Mode:33261 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:1480 Blksize:4096 Blocks:8 Atim:{Sec:1765560776 Nsec:387102185} Mtim:{Sec:1764192515 Nsec:0} Ctim:{Sec:1765560635 Nsec:161937846} X__unused:[0 0 0]}}
4:53AM DBG Sending chunk: {"created":1765601636,"object":"chat.completion.chunk","id":"26df1d0b-3b01-45ac-8728-166bce48d3e7","model":"localai-functioncall-qwen2.5-7b-v0.5","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":null}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
4:53AM DBG Loading GRPC Process: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC Service for localai-functioncall-qwen2.5-7b-v0.5 will be running at: '127.0.0.1:36539'
4:53AM DBG GRPC Service state dir: /tmp/go-processmanager3499901118
4:53AM DBG GRPC Service Started
4:53AM DBG Wait for the service to start up
4:53AM DBG Options: ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MMap:true NGPULayers:99999999 Threads:8 FlashAttention:"auto" Options:"gpu"
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr +++ realpath run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ dirname /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + CURDIR=/backends/vulkan-llama-cpp
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + cd /
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU info:'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU info:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e 'model\sname' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout model name : AMD Ryzen 7 5800X 8-Core Processor
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e flags /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-fallback
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU: AVX found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU: AVX found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx2\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU: AVX2 found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU: AVX2 found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx2 ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx512f\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -n '' ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ uname
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' Linux == Darwin ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + export LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using lib/ld.so
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using binary: llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -f /backends/vulkan-llama-cpp/lib/ld.so ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using lib/ld.so'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using binary: llama-cpp-avx2'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + exec /backends/vulkan-llama-cpp/lib/ld.so /backends/vulkan-llama-cpp/llama-cpp-avx2 --addr 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923461 36 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923621 36 ev_epoll1_linux.cc:125] grpc epoll fd: 3
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923741 36 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.924799 36 ev_epoll1_linux.cc:359] grpc epoll fd: 5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.925078 36 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Server listening on 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: starting llama server
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: waiting for model to be loaded
4:53AM DBG GRPC Service Ready
4:53AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc0007f3958} sizeCache:0 unknownFields:[] Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath://models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr build: 7157 (583cb8341) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system info: n_threads = 8, n_threads_batch = -1, total_threads = 16
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system_info: n_threads = 8 / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr srv load_model: loading model '/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6600 (RADV NAVI23)) (0000:0d:00.0) - 6980 MiB free
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf (version GGUF V3 (latest))
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 0: general.architecture str = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 1: general.type str = model
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 2: general.name str = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 3: general.version str = v0.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 4: general.organization str = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 5: general.finetune str = instruct-unsloth-bnb-4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 6: general.basename str = qwen2.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 7: general.size_label str = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 8: general.license str = apache-2.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 9: general.base_model.count u32 = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 11: general.base_model.0.organization str = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/unsloth/qwen2....
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 13: general.tags arr[str,6] = ["text-generation-inference", "transf...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 15: qwen2.block_count u32 = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151654
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = false
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 32: general.quantization_version u32 = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 33: general.file_type u32 = 15
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type f32: 141 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q4_K: 169 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q6_K: 29 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file format = GGUF V3 (latest)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file type = Q4_K - Medium
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file size = 4.36 GiB (4.91 BPW)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: printing all EOG tokens:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151643 ('<|endoftext|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151645 ('<|im_end|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151662 ('<|fim_pad|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151663 ('<|repo_name|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151664 ('<|file_sep|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: special tokens cache size = 22
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: token to piece cache size = 0.9310 MB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: arch = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab_only = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_train = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_inp = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_layer = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head_kv = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_rot = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_swa = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: is_swa_any = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_k = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_v = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_gqa = 7
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_k_gqa = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_v_gqa = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_eps = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_rms_eps = 1.0e-06
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_clamp_kqv = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_max_alibi_bias = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_logit_scale = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_attn_scale = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ff = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_used = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_groups = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_group_used = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: causal attn = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: pooling type = -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope type = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope scaling = linear
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_base_train = 1000000.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_scale_train = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_orig_yarn = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope_finetuned = unknown
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model type = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model params = 7.62 B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: general.name = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab type = BPE
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_vocab = 152064
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_merges = 151387
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: BOS token = 11 ','
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOS token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOT token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: PAD token = 151654 '<|vision_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: LF token = 198 'Ċ'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PRE token = 151659 '<|fim_prefix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SUF token = 151661 '<|fim_suffix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM MID token = 151660 '<|fim_middle|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PAD token = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM REP token = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SEP token = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151643 '<|endoftext|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: max token length = 256
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: loading model tensors, this can take a while... (mmap = true)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading 28 repeating layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading output layer to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloaded 29/29 layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: CPU_Mapped model buffer size = 292.36 MiB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: Vulkan0 model buffer size = 4168.09 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ....................................................................................
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: constructing llama_context
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_seq_max = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_batch = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ubatch = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: causal_attn = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: flash_attn = auto
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: kv_unified = false
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_base = 1000000.0
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_scale = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host output buffer size = 0.58 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache: Vulkan0 KV buffer size = 224.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Flash Attention was auto, set to enabled
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan0 compute buffer size = 304.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host compute buffer size = 15.02 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph nodes = 959
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph splits = 2
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|endoftext|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|im_end|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|fim_pad|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|repo_name|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|file_sep|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr free(): invalid pointer
4:54AM ERR Failed to load model localai-functioncall-qwen2.5-7b-v0.5 with backend llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=localai-functioncall-qwen2.5-7b-v0.5
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
4:54AM INF HTTP request method=POST path=/v1/chat/completions status=200

Additional context

There is a corresponding llama.cpp issue: ggml-org/llama.cpp#17561
According to that issue, the solution seems to be to update to a newer mesa version (which probably requires upgrading to a newer base image).

Workaround
Instead of qwen use a different model family such as gemma (e.g. gemma-3-4b-it works)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions