-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Description
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
version: 7360 (53ecd4fdb)
built with GNU 13.3.0 for Linux aarch64
NVIDIA DGX Spark Version 7.3.1 (GNU/Linux 6.14.0-1013-nvidia aarch64)
Operating systems
Linux
GGML backends
CUDA
Hardware
H/W path Device Class Description
========================================================
system HP ZGX Nano G1n AI Station (0000)
/0 bus 8EA3
/0/1 memory 64KiB BIOS
/0/a memory 64KiB L1 cache
/0/b memory 64KiB L1 cache
/0/c memory 512KiB L2 cache
/0/d memory 8MiB L3 cache
/0/e processor (Spark)
/0/10 memory 128GiB System Memory
/0/10/0 memory 128GiB Chip 8533 MHz (0.1 ns)
20 of these CPU cores:
01: None 00.0: 10103 CPU
[Created at cpu.343]
Unique ID: rdCR.j8NaKXDZtZ6
Hardware Class: cpu
Arch: AArch64
Vendor: "ARM Limited"
Model: 0.1.0 ""
Features: fp,asimd,evtstrm,aes,pmull,sha1,sha2,crc32,atomics,fphp,asimdhp,cpuid,asimdrdm,jscvt,fcma,lrcpc,dcpop,sha3,sm3,sm4,asimddp,sha512,sve,asimdfhm,dit,uscat,ilrcpc,flagm,sb,paca,pacg,dcpodp,sve2,sveaes,svepmull,svebitperm,svesha3,svesm4,flagm2,frint,svei8mm,svebf16,i8mm,bf16,dgh,bti,ecv,afp,wfxt
BogoMips: 2000.00
Config Status: cfg=new, avail=yes, need=no, active=unknown
Models
Loras:
https://huggingface.co/itaprac/loraFT_test-F16-GGUF/blob/main/loraFT_test-f16.gguf
https://huggingface.co/salchint/lynx-8b-Q8_0-GGUF/blob/main/lynx-8b-q8_0.gguf
https://huggingface.co/salchint/valadapt-meta-llama-3.1-8b-german-Q8_0-GGUF/blob/main/valadapt-meta-llama-3.1-8b-german-q8_0.gguf
https://huggingface.co/salchint/template-adapter-meta-llama-Llama-3.1-8B-legalbench-10-Q8_0-GGUF/blob/main/template-adapter-meta-llama-Llama-3.1-8B-legalbench-10-q8_0.gguf
Test Case 2
Model:
https://huggingface.co/premrajreddy/Home-TinyLlama-1.1B-HomeAssist-GGUF
Loras:
https://huggingface.co/salchint/ratatouille-0.1-tinyllama-1.1B-F16-GGUF
https://huggingface.co/salchint/tinyllama_review_summary_adapter_v1-F16-GGUF
https://huggingface.co/salchint/Tukan-1.1B-Chat-v0.1-F16-GGUF
Problem description & steps to reproduce
llama-server aborts if I start it using more than one --lora parameter. I read, that I should use a comma-separated list, but as this is not working, I just used multiple --lora options, which seems to be fine as far as the command line parsing is concerned, because in the server's logs I can see my LoRA adapters being loaded.
The problem occurs when the graph is built, the adapter weight is calculated and applied. The server aborts as the context's memory pool is exhausted. There is a slight difference in the callstack when two respectively more than 2 LoRAs are loaded. But in any case llm_graph_context::build_lora_mm() and the tensor operations done here are the triggers for the problem.
Reproduction
./llama-server -c 132000 -n 1024 -np 1 --port 8081 --lora /models/loras/loraFT_test-f16.gguf --lora /models/loras/cybersecurity/Llama-3.1-8B-cybersecurity.gguf --lora /models/loras/lynx-8b-q8_0.gguf --lora /models/loras/smartcontract-auditor-llama3.1-8b-adapter-f16.gguf --lora /models/loras/template-adapter-meta-llama-Llama-3.1-8B-legalbench-10-q8_0.gguf --lora /models/loras/valadapt-meta-llama-3.1-8b-german-q8_0.gguf --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --hf-file Meta-Llama-3.1-8B-Instruct-f32.gguf
Observed behavior
llama-server aborts after loading all LoRA adapters and doing the warm-up run. The crucial logging (I think) is this:
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/code/llama.cpp/ggml/src/ggml.c:1700: GGML_ASSERT(obj_new) failed
ggml_new_object: not enough space in the context's memory pool (needed 997056, available 996688)
Expected behavior
llama-server should not abort, all specified LoRA adapters shall be loaded and applied when requesting inference (e.g. via POST /completion endpoint)
First Bad Commit
As far as I can tell, it has never worked to use multiple LoRA adapters.
I followed back to commit 2048b5913d51. I was not able to test versions before this one, because it did not compile.
Relevant log output
# build/bin/RelWithDebInfo/llama-server -c 132000 -n 1024 -np 1 --port 8081 --lora /models/loras/loraFT_test-f16.gguf --lora /models/loras/cybersecurity/Llama-3.1-8B-cybersecurity.gguf --lora /models/loras/lynx-8b-q8_0.gguf --lora /models/loras/smartcontract-auditor-llama3.1-8b-adapter-f16.gguf --lora /models/loras/template-adapter-meta-llama-Llama-3.1-8B-legalbench-10-q8_0.gguf --lora /models/loras/valadapt-meta-llama-3.1-8b-german-q8_0.gguf --hf-repo bartowski/Meta-Llama-3.1-8B-Instruct-GGUF --hf-file Meta-Llama-3.1-8B-Instruct-f32.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
common_download_file_single_online: no previous model file found /root/.cache/llama.cpp/bartowski_Meta-Llama-3.1-8B-Instruct-GGUF_Meta-Llama-3.1-8B-Instruct-f32.gguf
common_download_file_single_online: trying to download model from https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-f32.gguf to /root/.cache/llama.cpp/bartowski_Meta-Llama-3.1-8B-Instruct-GGUF_Meta-Llama-3.1-8B-Instruct-f32.gguf.downloadInProgress (server_etag:"1a15f7ed175518e3af8dac732eecf9ddf79b8c1762978ab060f2b8701a571dfd", server_last_modified:)...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1357 100 1357 0 0 24664 0 --:--:-- --:--:-- --:--:-- 24664
100 29.9G 100 29.9G 0 0 105M 0 0:04:50 0:04:50 --:--:-- 111M
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 7360 (53ecd4fdb) with GNU 13.3.0 for Linux aarch64
system info: n_threads = 20, n_threads_batch = 20, total_threads = 20
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 1210 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/root/.cache/llama.cpp/bartowski_Meta-Llama-3.1-8B-Instruct-GGUF_Meta-Llama-3.1-8B-Instruct-f32.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GB10) (000f:01:00.0) - 116068 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /root/.cache/llama.cpp/bartowski_Meta-Llama-3.1-8B-Instruct-GGUF_Meta-Llama-3.1-8B-Instruct-f32.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 0
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 292 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = all F32
print_info: file size = 29.92 GiB (32.00 BPW)
load: printing all EOG tokens:
load: - 128001 ('<|end_of_text|>')
load: - 128008 ('<|eom_id|>')
load: - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: model type = 8B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CPU_Mapped model buffer size = 2004.00 MiB
load_tensors: CUDA0 model buffer size = 28629.02 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 132096
llama_context: n_ctx_seq = 132096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (132096) > n_ctx_train (131072) -- possible training context overflow
llama_context: CUDA_Host output buffer size = 1.96 MiB
llama_kv_cache: CUDA0 KV buffer size = 16512.00 MiB
llama_kv_cache: size = 16512.00 MiB (132096 cells, 32 layers, 4/1 seqs), K (f16): 8256.00 MiB, V (f16): 8256.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 419.01 MiB
llama_context: CUDA_Host compute buffer size = 266.01 MiB
llama_context: graph nodes = 999
llama_context: graph splits = 2
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/loraFT_test-f16.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = loraFT_test
llama_adapter_lora_init_impl: - kv 4: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 5: general.base_model.0.name str = Llama 3.1 8B
llama_adapter_lora_init_impl: - kv 6: general.base_model.0.organization str = Meta Llama
llama_adapter_lora_init_impl: - kv 7: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_adapter_lora_init_impl: - kv 8: general.tags arr[str,6] = ["base_model:adapter:meta-llama/Llama...
llama_adapter_lora_init_impl: - kv 9: adapter.lora.alpha f32 = 16.000000
llama_adapter_lora_init_impl: - kv 10: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 160.00 MiB
llama_adapter_lora_init_impl: loaded 448 tensors from lora file
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/cybersecurity/Llama-3.1-8B-cybersecurity.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = Fine_Tuned_Llama3_Ioc2
llama_adapter_lora_init_impl: - kv 4: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 5: general.base_model.0.name str = Meta Llama 3.1 8B
llama_adapter_lora_init_impl: - kv 6: general.base_model.0.organization str = Meta Llama
llama_adapter_lora_init_impl: - kv 7: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met...
llama_adapter_lora_init_impl: - kv 8: adapter.lora.alpha f32 = 64.000000
llama_adapter_lora_init_impl: - kv 9: llama.block_count u32 = 32
llama_adapter_lora_init_impl: - kv 10: llama.context_length u32 = 131072
llama_adapter_lora_init_impl: - kv 11: llama.embedding_length u32 = 4096
llama_adapter_lora_init_impl: - kv 12: llama.feed_forward_length u32 = 14336
llama_adapter_lora_init_impl: - kv 13: llama.attention.head_count u32 = 32
llama_adapter_lora_init_impl: - kv 14: llama.attention.head_count_kv u32 = 8
llama_adapter_lora_init_impl: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_adapter_lora_init_impl: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_adapter_lora_init_impl: - kv 17: general.file_type u32 = 1
llama_adapter_lora_init_impl: - kv 18: llama.vocab_size u32 = 128256
llama_adapter_lora_init_impl: - kv 19: llama.rope.dimension_count u32 = 128
llama_adapter_lora_init_impl: - kv 20: tokenizer.ggml.model str = gpt2
llama_adapter_lora_init_impl: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_adapter_lora_init_impl: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_adapter_lora_init_impl: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_adapter_lora_init_impl: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_adapter_lora_init_impl: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_adapter_lora_init_impl: - kv 26: tokenizer.ggml.eos_token_id u32 = 128001
llama_adapter_lora_init_impl: - kv 27: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 320.00 MiB
llama_adapter_lora_init_impl: loaded 448 tensors from lora file
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/lynx-8b-q8_0.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = meta-llama/Llama-3.1-8B-Instruct
llama_adapter_lora_init_impl: - kv 4: general.basename str = lynx
llama_adapter_lora_init_impl: - kv 5: general.size_label str = 8B
llama_adapter_lora_init_impl: - kv 6: general.license str = apache-2.0
llama_adapter_lora_init_impl: - kv 7: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 8: general.base_model.0.name str = Meta Llama 3.1 8B Instruct
llama_adapter_lora_init_impl: - kv 9: general.base_model.0.organization str = Meta Llama
llama_adapter_lora_init_impl: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met...
llama_adapter_lora_init_impl: - kv 11: general.dataset.count u32 = 1
llama_adapter_lora_init_impl: - kv 12: general.dataset.0.name str = CodeAlpaca 20k
llama_adapter_lora_init_impl: - kv 13: general.dataset.0.organization str = Sahil2801
llama_adapter_lora_init_impl: - kv 14: general.dataset.0.repo_url str = https://huggingface.co/sahil2801/Code...
llama_adapter_lora_init_impl: - kv 15: general.tags arr[str,1] = ["text-generation"]
llama_adapter_lora_init_impl: - kv 16: general.languages arr[str,1] = ["en"]
llama_adapter_lora_init_impl: - kv 17: adapter.lora.alpha f32 = 16.000000
llama_adapter_lora_init_impl: - kv 18: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 62.19 MiB
llama_adapter_lora_init_impl: loaded 448 tensors from lora file
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/smartcontract-auditor-llama3.1-8b-adapter-f16.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = meta-llama/Llama-3.1-8B
llama_adapter_lora_init_impl: - kv 4: general.finetune str = adapter
llama_adapter_lora_init_impl: - kv 5: general.basename str = smartcontract-auditor-llama3.1
llama_adapter_lora_init_impl: - kv 6: general.size_label str = 8B
llama_adapter_lora_init_impl: - kv 7: general.license str = apache-2.0
llama_adapter_lora_init_impl: - kv 8: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 9: general.base_model.0.name str = Meta Llama 3.1 8B
llama_adapter_lora_init_impl: - kv 10: general.base_model.0.organization str = Unsloth
llama_adapter_lora_init_impl: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/unsloth/Meta-L...
llama_adapter_lora_init_impl: - kv 12: general.tags arr[str,5] = ["text-generation-inference", "transf...
llama_adapter_lora_init_impl: - kv 13: general.languages arr[str,1] = ["en"]
llama_adapter_lora_init_impl: - kv 14: adapter.lora.alpha f32 = 16.000000
llama_adapter_lora_init_impl: - kv 15: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 640.00 MiB
llama_adapter_lora_init_impl: loaded 448 tensors from lora file
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/template-adapter-meta-llama-Llama-3.1-8B-legalbench-10-q8_0.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = meta-llama/Llama-3.1-8B
llama_adapter_lora_init_impl: - kv 4: general.version str = 10
llama_adapter_lora_init_impl: - kv 5: general.finetune str = legalbench
llama_adapter_lora_init_impl: - kv 6: general.basename str = template-adapter-meta-llama-Llama-3.1
llama_adapter_lora_init_impl: - kv 7: general.size_label str = 8B
llama_adapter_lora_init_impl: - kv 8: general.license str = llama3.1
llama_adapter_lora_init_impl: - kv 9: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 10: general.base_model.0.name str = Llama 3.1 8B
llama_adapter_lora_init_impl: - kv 11: general.base_model.0.organization str = Meta Llama
llama_adapter_lora_init_impl: - kv 12: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla...
llama_adapter_lora_init_impl: - kv 13: general.tags arr[str,1] = ["generated_from_trainer"]
llama_adapter_lora_init_impl: - kv 14: adapter.lora.alpha f32 = 16.000000
llama_adapter_lora_init_impl: - kv 15: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 28.03 MiB
llama_adapter_lora_init_impl: loaded 384 tensors from lora file
llama_adapter_lora_init_impl: loading lora adapter from '/models/loras/valadapt-meta-llama-3.1-8b-german-q8_0.gguf' ...
llama_adapter_lora_init_impl: Dumping metadata keys/values.
llama_adapter_lora_init_impl: - kv 0: general.architecture str = llama
llama_adapter_lora_init_impl: - kv 1: general.type str = adapter
llama_adapter_lora_init_impl: - kv 2: adapter.type str = lora
llama_adapter_lora_init_impl: - kv 3: general.name str = meta-llama/Llama-3.1-8B
llama_adapter_lora_init_impl: - kv 4: general.finetune str = german
llama_adapter_lora_init_impl: - kv 5: general.basename str = valadapt-meta-llama-3.1
llama_adapter_lora_init_impl: - kv 6: general.size_label str = 8B
llama_adapter_lora_init_impl: - kv 7: general.base_model.count u32 = 1
llama_adapter_lora_init_impl: - kv 8: general.base_model.0.name str = Meta Llama 3.1 8B
llama_adapter_lora_init_impl: - kv 9: general.base_model.0.organization str = Meta Llama
llama_adapter_lora_init_impl: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Met...
llama_adapter_lora_init_impl: - kv 11: adapter.lora.alpha f32 = 16.000000
llama_adapter_lora_init_impl: - kv 12: general.quantization_version u32 = 2
llama_adapter_lora_init_impl: CUDA0 LoRA buffer size = 27.66 MiB
llama_adapter_lora_init_impl: loaded 128 tensors from lora file
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: added <|eom_id|> logit bias = -inf
common_init_from_params: added <|eot_id|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 132096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
/code/llama.cpp/ggml/src/ggml.c:1700: GGML_ASSERT(obj_new) failed
ggml_new_object: not enough space in the context's memory pool (needed 997056, available 996688)
Registered pretty printers for STL classes
[New LWP 973303]
[New LWP 973302]
[New LWP 973301]
[New LWP 973300]
[New LWP 973299]
[New LWP 973298]
[New LWP 973297]
[New LWP 973296]
[New LWP 973295]
[New LWP 973294]
[New LWP 973293]
[New LWP 973292]
[New LWP 973291]
[New LWP 973290]
[New LWP 973289]
[New LWP 973288]
[New LWP 973287]
[New LWP 973286]
[New LWP 973285]
[New LWP 973284]
[New LWP 973255]
[New LWP 973253]
[New LWP 973249]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
warning: could not find '.gnu_debugaltlink' file for /lib/aarch64-linux-gnu/liblber.so.2
warning: could not find '.gnu_debugaltlink' file for /lib/aarch64-linux-gnu/libbrotlidec.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/aarch64-linux-gnu/libbrotlicommon.so.1
warning: could not find '.gnu_debugaltlink' file for /lib/aarch64-linux-gnu/libnss_mdns4_minimal.so.2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000f7ad4ae07b74 in __GI___wait4 (pid=pid@entry=973304, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0 0x0000f7ad4ae07b74 in __GI___wait4 (pid=pid@entry=973304, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000f7ad4ae07ce4 in __GI___waitpid (pid=pid@entry=973304, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38 ./posix/waitpid.c: No such file or directory
#2 0x0000f7ad4b28464c in ggml_print_backtrace () at /code/llama.cpp/ggml/src/ggml.c:217
217 waitpid(child_pid, NULL, 0);
#3 0x0000f7ad4b2847f0 in ggml_abort (file=file@entry=0xf7ad4b2c55c0 "/code/llama.cpp/ggml/src/ggml.c", line=line@entry=1700, fmt=fmt@entry=0xf7ad4b2c55a8 "GGML_ASSERT(%s) failed") at /code/llama.cpp/ggml/src/ggml.c:251
251 ggml_print_backtrace();
#4 0x0000f7ad4b2853a4 in ggml_new_tensor_impl (ctx=ctx@entry=0xc1254c153c50, type=type@entry=GGML_TYPE_F32, n_dims=n_dims@entry=4, ne=ne@entry=0xffffc1bff878, view_src=<optimized out>, view_src@entry=0x0, view_offs=view_offs@entry=0) at /code/llama.cpp/ggml/src/ggml.c:1700
1700 GGML_ASSERT(obj_new);
#5 0x0000f7ad4b285dfc in ggml_new_tensor (ctx=ctx@entry=0xc1254c153c50, type=type@entry=GGML_TYPE_F32, n_dims=n_dims@entry=4, ne=ne@entry=0xffffc1bff878) at /code/llama.cpp/ggml/src/ggml.c:1744
1744 return ggml_new_tensor_impl(ctx, type, n_dims, ne, NULL, 0);
#6 0x0000f7ad4b2884f4 in ggml_mul_mat (ctx=0xc1254c153c50, a=0xc1254d028fa0, b=0xc1254bf5a730) at /code/llama.cpp/ggml/src/ggml.c:3182
3182 struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
#7 0x0000f7ad4b3ff2cc in llm_graph_context::build_lora_mm (this=this@entry=0xc1254b96e680, w=0xc1254f8b1870, cur=cur@entry=0xc1254bf5a730) at /code/llama.cpp/src/llama-graph.cpp:626
626 ggml_tensor * ab_cur = ggml_mul_mat(
#8 0x0000f7ad4b4cef28 in llm_build_llama::llm_build_llama (this=0xc1254b96e680, model=..., params=...) at /usr/include/c++/13/bits/stl_vector.h:1145
1145 operator[](size_type __n) const _GLIBCXX_NOEXCEPT
#9 0x0000f7ad4b469f84 in std::make_unique<llm_build_llama, llama_model const&, llm_graph_params const&> () at /usr/include/c++/13/bits/unique_ptr.h:1070
1070 { return unique_ptr<_Tp>(new _Tp(std::forward<_Args>(__args)...)); }
#10 0x0000f7ad4b43b898 in llama_model::build_graph (this=0xc1254b7f6970, params=...) at /code/llama.cpp/src/llama-model.cpp:7155
7155 llm = std::make_unique<llm_build_llama>(*this, params);
#11 0x0000f7ad4b3d2f0c in llama_context::process_ubatch (this=this@entry=0xc1254b863fd0, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0xc1254b8802b0, ret=@0xffffc1c03a34: GGML_STATUS_SUCCESS) at /code/llama.cpp/src/llama-context.cpp:775
775 gf = model.build_graph(gparams);
#12 0x0000f7ad4b3d94e8 in llama_context::decode (this=0xc1254b863fd0, batch_inp=...) at /code/llama.cpp/src/llama-context.cpp:1105
1105 const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#13 0x0000f7ad4b3da120 in llama_decode (ctx=<optimized out>, batch=<error reading variable: Cannot access memory at address 0x0>) at /code/llama.cpp/src/llama-context.cpp:2772
2772 const int ret = ctx->decode(batch);
#14 0x0000c12528e17ff0 in common_init_from_params (params=...) at /code/llama.cpp/common/common.cpp:1236
1236 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch)));
#15 0x0000c12528d1ab98 in server_context_impl::load_model (this=0xc1254b724760, params=...) at /code/llama.cpp/tools/server/server-context.cpp:581
581 llama_init = common_init_from_params(params_base);
#16 0x0000c12528cfba4c in server_context::load_model (this=this@entry=0xffffc1c05e28, params=...) at /code/llama.cpp/tools/server/server-context.cpp:2651
2651 return impl->load_model(params);
#17 0x0000c12528c608fc in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /code/llama.cpp/tools/server/server.cpp:236
236 if (!ctx_server.load_model(params)) {
[Inferior 1 (process 973248) detached]
Aborted (core dumped)