-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
LocalAI version:
3.8.0
Environment, CPU architecture, OS, and Version:
LocalAI 3.8.0 container (Ubuntu 22.04 with mesa 23.2.1) on an amd64 Ubuntu 24.04 host (with mesa 25.0.7)
Describe the bug
When attempting to chat with a qwen model run within llama.cpp on vulkan, it fails with stderr free(): invalid pointer.
To Reproduce
- Make sure to delete the model and backend you may have downloaded using an older LocalAI version.
- Start LocalAI 3.8.0 with Vulkan.
- Download the
localai-functioncall-qwen2.5-7b-v0.5orqwen3-4bmodel. - Try to chat with e.g.
localai-functioncall-qwen2.5-7b-v0.5orqwen3-4b.
Expected behavior
The chat completion API should work in LocalAI as it did in the previous version (3.7.0).
Logs
Log
CPU info:
model name : AMD Ryzen 7 5800X 8-Core Processor
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
CPU: AVX found OK
CPU: AVX2 found OK
CPU: no AVX512 found
4:53AM DBG Setting logging to debug
4:53AM DBG GPUs gpus=[{"address":"0000:0d:00.0","index":1,"pci":{"address":"0000:0d:00.0","class":{"id":"03","name":"Display controller"},"driver":"amdgpu","product":{"id":"73ff","name":"Navi 23 [Radeon RX 6600/6600 XT/6600M]"},"programming_interface":{"id":"00","name":"VGA controller"},"revision":"0xc7","subclass":{"id":"00","name":"VGA compatible controller"},"subsystem":{"id":"6505","name":"unknown"},"vendor":{"id":"1002","name":"Advanced Micro Devices, Inc. [AMD/ATI]"}}}]
4:53AM DBG GPU vendor gpuVendor=amd
4:53AM DBG Total available VRAM vram=0
4:53AM INF Starting LocalAI using 8 threads, with models path: //models
4:53AM INF LocalAI version: v3.8.0 (c0d1d0211f040461defb2547a97bdf1743a78e60)
4:53AM DBG CPU capabilities: [3dnowprefetch abm adx aes aperfmperf apic arat avic avx avx2 bmi1 bmi2 bpext cat_l3 cdp_l3 clflush clflushopt clwb clzero cmov cmp_legacy constant_tsc cpb cpuid cqm cqm_llc cqm_mbm_local cqm_mbm_total cqm_occup_llc cr8_legacy cx16 cx8 de debug_swap decodeassists erms extapic extd_apicid f16c flushbyasid fma fpu fsgsbase fsrm fxsr fxsr_opt ht hw_pstate ibpb ibpb_exit_to_user ibrs ibs invpcid irperf lahf_lm lbrv lm mba mca mce misalignsse mmx mmxext monitor movbe msr mtrr mwaitx nonstop_tsc nopl npt nrip_save nx ospke osvw overflow_recov pae pat pausefilter pclmulqdq pdpe1gb perfctr_core perfctr_llc perfctr_nb pfthreshold pge pku pni popcnt pse pse36 rapl rdpid rdpru rdrand rdseed rdt_a rdtscp rep_good sep sha_ni skinit smap smca smep ssbd sse sse2 sse4_1 sse4_2 sse4a ssse3 stibp succor svm_lock syscall tce topoext tsc tsc_scale umip user_shstk v_spec_ctrl v_vmsave_vmload vaes vgif vmcb_clean vme vmmcall vpclmulqdq wbnoinvd wdt x2apic xgetbv1 xsave xsavec xsaveerptr xsaveopt xsaves]
4:53AM DBG GPU count: 1
4:53AM DBG GPU: card #1 @0000:0d:00.0 -> driver: 'amdgpu' class: 'Display controller' vendor: 'Advanced Micro Devices, Inc. [AMD/ATI]' product: 'Navi 23 [Radeon RX 6600/6600 XT/6600M]'
...
⇨ http server started on [::]:8080
4:53AM DBG context local model name not found, setting to the first model first model name=gemma-3-4b-it
4:53AM DBG guessDefaultsFromFile: NGPULayers set NGPULayers=99999999
4:53AM DBG guessDefaultsFromFile: template already set name=localai-functioncall-qwen2.5-7b-v0.5
4:53AM DBG Chat endpoint configuration read: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs: LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
Completion:{{.Input}}
Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
UseTokenizerTemplate:false JoinChatMessagesByCharacter: Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)(.*?)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)(.*?) Value:}] CaptureLLMResult:[(?s)(.*?)] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|> ] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention: NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG Parameters: &{modelConfigFile:/models/localai-functioncall-qwen2.5-7b-v0.5.yaml PredictionOptions:{BasicModelRequest:{Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf} Language: Translate:false N:0 TopP:0xc000c4ed70 TopK:0xc000c4ed78 Temperature:0xc000c4ed80 Maxtokens:0xc000c4edb0 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000c4eda8 TypicalP:0xc000c4eda0 Seed:0xc000c4edc0 Logprobs:{Enabled:false} TopLogprobs: LogitBias:map[] NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 ClipSkip:0 Tokenizer:} Name:localai-functioncall-qwen2.5-7b-v0.5 F16:0xc000c4ed28 Threads:0xc000c4ed60 Debug:0xc002105f08 Roles:map[] Embeddings:0xc000c4edb9 Backend:llama-cpp TemplateConfig:{Chat:{{.Input -}}
<|im_start|>assistant
ChatMessage:<|im_start|>{{ .RoleName }}
{{ if .FunctionCall -}}
Function call:
{{ else if eq .RoleName "tool" -}}
Function response:
{{ end -}}
{{ if .Content -}}
{{.Content }}
{{ end -}}
{{ if .FunctionCall -}}
{{toJson .FunctionCall}}
{{ end -}}<|im_end|>
Completion:{{.Input}}
Edit: Functions:<|im_start|>system
You are an AI assistant that executes function calls, and these are the tools at your disposal:
{{range .Functions}}
{'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }}
{{end}}
<|im_end|>
{{.Input -}}
<|im_start|>assistant
UseTokenizerTemplate:false JoinChatMessagesByCharacter: Multimodal: ReplyPrefix:} KnownUsecaseStrings:[FLAG_CHAT FLAG_COMPLETION] KnownUsecases:0xc001d73358 Pipeline:{TTS: LLM: Transcription: VAD:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:name,arguments SchemaType: GrammarTriggers:[]} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[(?s)(.*?)] ArgumentRegex:[] ArgumentRegexKey: ArgumentRegexValue: ReplaceFunctionResults:[] ReplaceLLMResult:[{Key:(?s)(.*?) Value:}] CaptureLLMResult:[(?s)(.*?)] FunctionNameKey: FunctionArgumentsKey:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000c4ed98 MirostatTAU:0xc000c4ed90 Mirostat:0xc000c4ed88 NGPULayers:0xc0018d8558 MMap:0xc000c4ed2c MMlock:0xc000c4edb9 LowVRAM:0xc000c4edb9 Reranking:0xc000c4edb9 Grammar: StopWords:[<|im_end|> ] Cutstrings:[] ExtractRegex:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc000c4ed18 NUMA:false LoraAdapter: LoraBase: LoraAdapters:[] LoraScales:[] LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: LoadFormat: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 DisableLogStatus:false DType: LimitMMPerPrompt:{LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0} MMProj: FlashAttention: NoKVOffloading:false CacheTypeK: CacheTypeV: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 CFGScale:0} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: AudioPath:} CUDA:false DownloadFiles:[] Description: Usage: Options:[gpu] Overrides:[] MCP:{Servers: Stdio:} Agent:{MaxAttempts:0 MaxIterations:0 EnableReasoning:false EnablePlanning:false EnableMCPPrompts:false EnablePlanReEvaluator:false}}
4:53AM DBG templated message for chat: <|im_start|>user
what's the capital of germany?
<|im_end|>
4:53AM DBG Prompt (before templating): <|im_start|>user
what's the capital of germany?
<|im_end|>
4:53AM DBG Template found, input modified to: <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant
4:53AM DBG Prompt (after templating): <|im_start|>user
what's the capital of germany?
<|im_end|>
<|im_start|>assistant
4:53AM DBG Stream request received
4:53AM INF BackendLoader starting backend=llama-cpp modelID=localai-functioncall-qwen2.5-7b-v0.5 o.model=localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading model in memory from file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf
4:53AM DBG Loading Model localai-functioncall-qwen2.5-7b-v0.5 with gRPC (file: /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf) (backend: llama-cpp): {backendString:llama-cpp model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf modelID:localai-functioncall-qwen2.5-7b-v0.5 context:{emptyCtx:{}} gRPCOptions:0xc0001f4f08 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 parallelRequests:false}
4:53AM DBG Loading external backend: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG external backend is file: &{name:run.sh size:1480 mode:493 modTime:{wall:0 ext:63899789315 loc:0x4c9f5a0} sys:{Dev:2304 Ino:26364411 Nlink:1 Mode:33261 Uid:0 Gid:0 X__pad0:0 Rdev:0 Size:1480 Blksize:4096 Blocks:8 Atim:{Sec:1765560776 Nsec:387102185} Mtim:{Sec:1764192515 Nsec:0} Ctim:{Sec:1765560635 Nsec:161937846} X__unused:[0 0 0]}}
4:53AM DBG Sending chunk: {"created":1765601636,"object":"chat.completion.chunk","id":"26df1d0b-3b01-45ac-8728-166bce48d3e7","model":"localai-functioncall-qwen2.5-7b-v0.5","choices":[{"index":0,"finish_reason":null,"delta":{"role":"assistant","content":null}}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}
4:53AM DBG Loading GRPC Process: /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC Service for localai-functioncall-qwen2.5-7b-v0.5 will be running at: '127.0.0.1:36539'
4:53AM DBG GRPC Service state dir: /tmp/go-processmanager3499901118
4:53AM DBG GRPC Service Started
4:53AM DBG Wait for the service to start up
4:53AM DBG Options: ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MMap:true NGPULayers:99999999 Threads:8 FlashAttention:"auto" Options:"gpu"
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr +++ realpath run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ dirname /backends/vulkan-llama-cpp/run.sh
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + CURDIR=/backends/vulkan-llama-cpp
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + cd /
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU info:'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU info:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e 'model\sname' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout model name : AMD Ryzen 7 5800X 8-Core Processor
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -e flags /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + head -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap ibpb_exit_to_user
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-fallback
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU: AVX found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU: AVX found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx2\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout CPU: AVX2 found OK
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'CPU: AVX2 found OK'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -e /backends/vulkan-llama-cpp/llama-cpp-avx2 ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + BINARY=llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + grep -q -e '\savx512f\s' /proc/cpuinfo
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -n '' ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ++ uname
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' Linux == Darwin ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + export LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using lib/ld.so
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Using binary: llama-cpp-avx2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + LD_LIBRARY_PATH=/backends/vulkan-llama-cpp/lib:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + '[' -f /backends/vulkan-llama-cpp/lib/ld.so ']'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using lib/ld.so'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + echo 'Using binary: llama-cpp-avx2'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr + exec /backends/vulkan-llama-cpp/lib/ld.so /backends/vulkan-llama-cpp/llama-cpp-avx2 --addr 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923461 36 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache, work_serializer_dispatch
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923621 36 ev_epoll1_linux.cc:125] grpc epoll fd: 3
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.923741 36 server_builder.cc:392] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.924799 36 ev_epoll1_linux.cc:359] grpc epoll fd: 5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr I0000 00:00:1765601636.925078 36 tcp_socket_utils.cc:634] TCP_USER_TIMEOUT is available. TCP_USER_TIMEOUT will be used thereafter
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stdout Server listening on 127.0.0.1:36539
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: starting llama server
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr start_llama_server: waiting for model to be loaded
4:53AM DBG GRPC Service Ready
4:53AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:0xc0007f3958} sizeCache:0 unknownFields:[] Model:localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf ContextSize:4096 Seed:1161677669 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: DisableLogStatus:false DType: LimitImagePerPrompt:0 LimitVideoPerPrompt:0 LimitAudioPerPrompt:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:auto NoKVOffload:false ModelPath://models LoraAdapters:[] LoraScales:[] Options:[gpu] CacheTypeKey: CacheTypeValue: GrammarTriggers:[] Reranking:false Overrides:[]}
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr build: 7157 (583cb8341) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system info: n_threads = 8, n_threads_batch = -1, total_threads = 16
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr system_info: n_threads = 8 / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr srv load_model: loading model '/models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon RX 6600 (RADV NAVI23)) (0000:0d:00.0) - 6980 MiB free
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: loaded meta data with 34 key-value pairs and 339 tensors from /models/localai-functioncall-qwen2.5-7b-v0.5-q4_k_m.gguf (version GGUF V3 (latest))
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 0: general.architecture str = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 1: general.type str = model
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 2: general.name str = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 3: general.version str = v0.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 4: general.organization str = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 5: general.finetune str = instruct-unsloth-bnb-4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 6: general.basename str = qwen2.5
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 7: general.size_label str = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 8: general.license str = apache-2.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 9: general.base_model.count u32 = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 10: general.base_model.0.name str = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 11: general.base_model.0.organization str = Unsloth
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/unsloth/qwen2....
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 13: general.tags arr[str,6] = ["text-generation-inference", "transf...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 14: general.languages arr[str,1] = ["en"]
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 15: qwen2.block_count u32 = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 16: qwen2.context_length u32 = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 24: tokenizer.ggml.pre str = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 151645
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 151654
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = false
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 31: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 32: general.quantization_version u32 = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - kv 33: general.file_type u32 = 15
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type f32: 141 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q4_K: 169 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_model_loader: - type q6_K: 29 tensors
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file format = GGUF V3 (latest)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file type = Q4_K - Medium
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: file size = 4.36 GiB (4.91 BPW)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: printing all EOG tokens:
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151643 ('<|endoftext|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151645 ('<|im_end|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151662 ('<|fim_pad|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151663 ('<|repo_name|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: - 151664 ('<|file_sep|>')
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: special tokens cache size = 22
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load: token to piece cache size = 0.9310 MB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: arch = qwen2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab_only = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_train = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_inp = 3584
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_layer = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head = 28
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_head_kv = 4
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_rot = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_swa = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: is_swa_any = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_k = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_head_v = 128
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_gqa = 7
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_k_gqa = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_embd_v_gqa = 512
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_eps = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_norm_rms_eps = 1.0e-06
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_clamp_kqv = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_max_alibi_bias = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_logit_scale = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: f_attn_scale = 0.0e+00
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ff = 18944
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_used = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_expert_groups = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_group_used = 0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: causal attn = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: pooling type = -1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope type = 2
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope scaling = linear
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_base_train = 1000000.0
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: freq_scale_train = 1
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_ctx_orig_yarn = 32768
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: rope_finetuned = unknown
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model type = 7B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: model params = 7.62 B
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: general.name = Qwen2.5 7b Instruct Unsloth Bnb 4bit
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: vocab type = BPE
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_vocab = 152064
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: n_merges = 151387
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: BOS token = 11 ','
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOS token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOT token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: PAD token = 151654 '<|vision_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: LF token = 198 'Ċ'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PRE token = 151659 '<|fim_prefix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SUF token = 151661 '<|fim_suffix|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM MID token = 151660 '<|fim_middle|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM PAD token = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM REP token = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: FIM SEP token = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151643 '<|endoftext|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151645 '<|im_end|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151662 '<|fim_pad|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151663 '<|repo_name|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: EOG token = 151664 '<|file_sep|>'
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr print_info: max token length = 256
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: loading model tensors, this can take a while... (mmap = true)
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading 28 repeating layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloading output layer to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: offloaded 29/29 layers to GPU
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: CPU_Mapped model buffer size = 292.36 MiB
4:53AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr load_tensors: Vulkan0 model buffer size = 4168.09 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr ....................................................................................
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: constructing llama_context
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_seq_max = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_batch = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ubatch = 512
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: causal_attn = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: flash_attn = auto
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: kv_unified = false
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_base = 1000000.0
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: freq_scale = 1
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: n_ctx_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host output buffer size = 0.58 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache: Vulkan0 KV buffer size = 224.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Flash Attention was auto, set to enabled
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan0 compute buffer size = 304.00 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: Vulkan_Host compute buffer size = 15.02 MiB
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph nodes = 959
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr llama_context: graph splits = 2
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|endoftext|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|im_end|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|fim_pad|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|repo_name|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: added <|file_sep|> logit bias = -inf
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
4:54AM DBG GRPC(localai-functioncall-qwen2.5-7b-v0.5-127.0.0.1:36539): stderr free(): invalid pointer
4:54AM ERR Failed to load model localai-functioncall-qwen2.5-7b-v0.5 with backend llama-cpp error="failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" modelID=localai-functioncall-qwen2.5-7b-v0.5
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM DBG No choices in the response, skipping
4:54AM ERR Stream ended with error: failed to load model with internal loader: could not load model: rpc error: code = Unavailable desc = error reading from server: EOF
4:54AM INF HTTP request method=POST path=/v1/chat/completions status=200
Additional context
There is a corresponding llama.cpp issue: ggml-org/llama.cpp#17561
According to that issue, the solution seems to be to update to a newer mesa version (which probably requires upgrading to a newer base image).
Workaround
Instead of qwen use a different model family such as gemma (e.g. gemma-3-4b-it works)