We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: 4451 (d9feae1) with MSVC 19.29.30157.0 for
Windows
CUDA
Ryzen 7950x3d + RTX 3090
phi 4
phi 4 - input is empty
Just load model
No response
llama-cli.exe --model models/new3/phi-4-Q8_0.gguf --color --threads 30 --keep -1 --n-predict -1 --ctx-size 16384 --interactive -ngl 99 --simple-io -e --multiline-input --no-display-prompt --conversation --no-mmap ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes build: 4451 (d9feae1c) with MSVC 19.29.30157.0 for main: llama backend init main: load the model and apply lora adapter, if any llama_model_load_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free llama_model_loader: loaded meta data with 37 key-value pairs and 243 tensors from models/new3/phi-4-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Phi 4 llama_model_loader: - kv 3: general.version str = 4 llama_model_loader: - kv 4: general.organization str = Microsoft llama_model_loader: - kv 5: general.basename str = phi llama_model_loader: - kv 6: general.size_label str = 15B llama_model_loader: - kv 7: general.license str = mit llama_model_loader: - kv 8: general.license.link str = https://huggingface.co/microsoft/phi-... llama_model_loader: - kv 9: general.tags arr[str,7] = ["phi", "nlp", "math", "code", "chat"... llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"] llama_model_loader: - kv 11: phi3.context_length u32 = 16384 llama_model_loader: - kv 12: phi3.rope.scaling.original_context_length u32 = 16384 llama_model_loader: - kv 13: phi3.embedding_length u32 = 5120 llama_model_loader: - kv 14: phi3.feed_forward_length u32 = 17920 llama_model_loader: - kv 15: phi3.block_count u32 = 40 llama_model_loader: - kv 16: phi3.attention.head_count u32 = 40 llama_model_loader: - kv 17: phi3.attention.head_count_kv u32 = 10 llama_model_loader: - kv 18: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 19: phi3.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: phi3.rope.freq_base f32 = 250000.000000 llama_model_loader: - kv 21: general.file_type u32 = 7 llama_model_loader: - kv 22: phi3.attention.sliding_window u32 = 0 llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = dbrx llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,100352] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,100352] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,100000] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 100257 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 100257 llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 100257 llama_model_loader: - kv 31: tokenizer.chat_template str = {% for message in messages %}{% if (m... llama_model_loader: - kv 32: general.quantization_version u32 = 2 llama_model_loader: - kv 33: quantize.imatrix.file str = /models_out/phi-4-GGUF/phi-4.imatrix llama_model_loader: - kv 34: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 35: quantize.imatrix.entries_count i32 = 160 llama_model_loader: - kv 36: quantize.imatrix.chunks_count i32 = 127 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q8_0: 162 tensors llm_load_vocab: special tokens cache size = 96 llm_load_vocab: token to piece cache size = 0.6151 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi3 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 100352 llm_load_print_meta: n_merges = 100000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 16384 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 10 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1280 llm_load_print_meta: n_embd_v_gqa = 1280 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 17920 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 250000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 16384 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 14B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 14.66 B llm_load_print_meta: model size = 14.51 GiB (8.50 BPW) llm_load_print_meta: general.name = Phi 4 llm_load_print_meta: BOS token = 100257 '<|endoftext|>' llm_load_print_meta: EOS token = 100257 '<|endoftext|>' llm_load_print_meta: EOT token = 100265 '<|im_end|>' llm_load_print_meta: PAD token = 100257 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: FIM PRE token = 100258 '<|fim_prefix|>' llm_load_print_meta: FIM SUF token = 100260 '<|fim_suffix|>' llm_load_print_meta: FIM MID token = 100259 '<|fim_middle|>' llm_load_print_meta: EOG token = 100257 '<|endoftext|>' llm_load_print_meta: EOG token = 100265 '<|im_end|>' llm_load_print_meta: max token length = 256 llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading output layer to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: CUDA_Host model buffer size = 520.62 MiB llm_load_tensors: CUDA0 model buffer size = 14334.71 MiB ..................................................................................... llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_ctx_per_seq = 16384 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 250000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: kv_size = 16384, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 40, can_shift = 1 llama_kv_cache_init: CUDA0 KV buffer size = 3200.00 MiB llama_new_context_with_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.38 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1357.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 42.01 MiB llama_new_context_with_model: graph nodes = 1606 llama_new_context_with_model: graph splits = 2 common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) main: llama threadpool init, n_threads = 30 main: chat template example: <|im_start|>system<|im_sep|>You are a helpful assistant<|im_end|><|im_start|>user<|im_sep|>Hello<|im_end|><|im_start|>assistant<|im_sep|>Hi there<|im_end|><|im_start|>user<|im_sep|>How are you?<|im_end|><|im_start|>assistant<|im_sep|> system_info: n_threads = 30 (n_threads_batch = 30) / 32 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | input is empty
The text was updated successfully, but these errors were encountered:
You need to add system message: -p "You are a helpful assistant"
-p "You are a helpful assistant"
Sorry, something went wrong.
Successfully merging a pull request may close this issue.
Name and Version
build: 4451 (d9feae1) with MSVC 19.29.30157.0 for
Operating systems
Windows
GGML backends
CUDA
Hardware
Ryzen 7950x3d + RTX 3090
Models
phi 4
Problem description & steps to reproduce
phi 4 - input is empty
Just load model
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: