Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeepSeek Models (V2/V3) Hang with ROCm Backend #11141

Open
emuchogu opened this issue Jan 8, 2025 · 7 comments
Open

DeepSeek Models (V2/V3) Hang with ROCm Backend #11141

emuchogu opened this issue Jan 8, 2025 · 7 comments

Comments

@emuchogu
Copy link

emuchogu commented Jan 8, 2025

Name and Version

./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 ROCm devices:
Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 4: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 5: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 6: AMD Instinct MI100, compute capability 9.0, VMM: no
Device 7: AMD Instinct MI100, compute capability 9.0, VMM: no
version: 4436 (53ff6b9)
built with Ubuntu clang version 12.0.1-19ubuntu3 for x86_64-pc-linux-gnu

Operating systems

Linux

GGML backends

HIP

Hardware

AMD Instinct MI100

Models

DeepSeek-V2
DeepSeek-V3

Problem description & steps to reproduce

Description

When attempting to run DeepSeek models (V2 or V3) using the ROCm backend, the models load successfully into VRAM but fail to generate any output. One GPU becomes pinned at 100% utilization while the others remain idle.

Commands Used

DeepSeek V2

./llama-cli -m /models/DeepSeek-V2-Chat-0628-Q4_K_M-00001-of-00004.gguf -ngl 999 --prompt '<|User|>why is the sky blue?<|Assistant|>'

DeepSeek V3

./llama-cli -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 --prompt '<|User|>why is the sky blue?<|Assistant|>'

Observed Behavior

  1. Model loads successfully and distributes across available GPUs
  2. After loading, one GPU gets stuck at 100% utilization
  3. No text generation occurs
  4. Other GPUs remain idle with only VRAM usage showing

Steps to Reproduce

  1. Load DeepSeek model (V2 or V3) using llama.cpp with ROCm backend
  2. Set appropriate number of layers for GPU offload (-ngl parameter)
    • For V2: use -ngl 999 for automatic layer distribution
    • For V3: use -ngl 48 for specific layer allocation
  3. Attempt text generation with any prompt using the commands shown above

Additional Notes

  • Both models exhibit similar behavior despite different quantization methods
  • Model loading and VRAM distribution appears normal
  • Issue occurs consistently across multiple attempts
  • The same behavior happens when running deepseek-v2:16b-lite-chat-q4_K_M in Ollama

First Bad Commit

No response

Relevant log output

root@dd8e6159288b:/app/build/bin# ./llama-cli -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 --prompt '<|User|>why is the sky blue?<|Assistant|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 ROCm devices:
  Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 4: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 5: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 6: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 7: AMD Instinct MI100, compute capability 9.0, VMM: no
build: 4436 (53ff6b9b) with Ubuntu clang version 12.0.1-19ubuntu3 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file: using device ROCm0 (AMD Instinct MI100) - 32180 MiB free
llama_model_load_from_file: using device ROCm1 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm2 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm3 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm4 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm5 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm6 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm7 (AMD Instinct MI100) - 32714 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from /deepseek-v3/deepseek-v3-unsloght/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 BF16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   5:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   6:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   7:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   8:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   9:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  10:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  13:                          general.file_type u32              = 10
llama_model_loader: - kv  14:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  15:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  16:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  17:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  18:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  19:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  20:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  21:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  22:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  23:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  25:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  26:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  29: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  30: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  40:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                                   split.no u16              = 0
llama_model_loader: - kv  44:                                split.count u16              = 5
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1025
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q2_K:  482 tensors
llama_model_loader: - type q3_K:  180 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 227.47 GiB (2.91 BPW) 
llm_load_print_meta: general.name     = DeepSeek V3 BF16
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: offloaded 48/62 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size = 41684.45 MiB
llm_load_tensors:        ROCm0 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm1 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm2 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm3 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm4 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm5 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm6 model buffer size = 23905.15 MiB
llm_load_tensors:        ROCm7 model buffer size = 23905.15 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size =  4160.00 MiB
llama_kv_cache_init:      ROCm0 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm1 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm2 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm3 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm4 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm5 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm6 KV buffer size =  1920.00 MiB
llama_kv_cache_init:      ROCm7 KV buffer size =  1920.00 MiB
llama_new_context_with_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =  2790.00 MiB
llama_new_context_with_model:      ROCm1 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm2 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm3 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm4 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm5 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm6 compute buffer size =  1186.00 MiB
llama_new_context_with_model:      ROCm7 compute buffer size =  1186.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    88.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 243 (with bs=512), 10 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

sampler seed: 4089827234
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

why is the sky blue?


========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK     MCLK     Fan  Perf  PwrCap  VRAM%  GPU%
^[3m              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                    ^[0m
====================================================================================================================
0       1     0x738c,   16733  54.0°C  97.0W  N/A, N/A, 0         1502Mhz  1200Mhz  0%   auto  290.0W  89%    100%
1       2     0x738c,   57681  41.0°C  35.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
2       3     0x738c,   33109  42.0°C  39.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
3       4     0x738c,   8559   42.0°C  39.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
4       5     0x738c,   57703  41.0°C  34.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
5       6     0x738c,   33123  39.0°C  34.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
6       7     0x738c,   57724  41.0°C  39.0W  N/A, N/A, 0         300Mhz   1200Mhz  0%   auto  290.0W  84%    0%
@ggerganov
Copy link
Owner

Does it run with CPU-only on this system?

Attach the output of:

GGML_SCHED_DEBUG=2 ./bin/llama-eval-callback -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 -n 1 -lv 1

@emuchogu
Copy link
Author

emuchogu commented Jan 8, 2025

Here is the output for the ./bin/llama-eval-callback command:
logs.txt

Command used:
NOTE: while the command is running all GPUS remain at 0% activity

GGML_SCHED_DEBUG=2 ./llama-eval-callback -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 -n 1 -lv 1

Does it run with CPU-only on this system?
Yes. It's runs with CPU-only as expected, see logs below:

./llama-cli -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf --prompt '<|User|>why is the sky blue?<|Assistant|>'

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 ROCm devices:
  Device 0: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 1: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 2: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 3: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 4: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 5: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 6: AMD Instinct MI100, compute capability 9.0, VMM: no
  Device 7: AMD Instinct MI100, compute capability 9.0, VMM: no
build: 4436 (53ff6b9b) with Ubuntu clang version 12.0.1-19ubuntu3 for x86_64-pc-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file: using device ROCm0 (AMD Instinct MI100) - 32180 MiB free
llama_model_load_from_file: using device ROCm1 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm2 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm3 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm4 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm5 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm6 (AMD Instinct MI100) - 32714 MiB free
llama_model_load_from_file: using device ROCm7 (AMD Instinct MI100) - 32714 MiB free
llama_model_loader: additional 4 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 46 key-value pairs and 1025 tensors from /deepseek-v3/deepseek-v3-unsloght/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 BF16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   5:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   6:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   7:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   8:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   9:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  10:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  13:                          general.file_type u32              = 10
llama_model_loader: - kv  14:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  15:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  16:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  17:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  18:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  19:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  20:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  21:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  22:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  23:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  25:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  26:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  29: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  30: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  40:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                                   split.no u16              = 0
llama_model_loader: - kv  44:                                split.count u16              = 5
llama_model_loader: - kv  45:                        split.tensors.count i32              = 1025
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q2_K:  482 tensors
llama_model_loader: - type q3_K:  180 tensors
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q2_K - Medium
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 227.47 GiB (2.91 BPW) 
llm_load_print_meta: general.name     = DeepSeek V3 BF16
llm_load_print_meta: BOS token        = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<|fim▁begin|>'
llm_load_print_meta: FIM SUF token    = 128800 '<|fim▁hole|>'
llm_load_print_meta: FIM MID token    = 128802 '<|fim▁end|>'
llm_load_print_meta: EOG token        = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/62 layers to GPU
llm_load_tensors:   CPU_Mapped model buffer size = 47284.62 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 46264.28 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 46559.74 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 46622.68 MiB
llm_load_tensors:   CPU_Mapped model buffer size = 46194.32 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 4096
llama_new_context_with_model: n_ctx_per_seq = 4096
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 19520.00 MiB
llama_new_context_with_model: KV self size  = 19520.00 MiB, K (f16): 11712.00 MiB, V (f16): 7808.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =  2790.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    88.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 1148 (with bs=512), 1 (with bs=1)
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 20

system_info: n_threads = 20 (n_threads_batch = 20) / 20 | ROCm : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

sampler seed: 2671311658
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

why is the sky blue?The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here's how it works:

1. **Sunlight and Earth's Atmosphere**: Sunlight, which appears white, is made up of different colors of light, each with its own wavelength. When sunlight reaches Earth, it interacts with the molecules and particles in the atmosphere.

2. **Scattering of Light**: Shorter wavelengths of light, such as blue and violet, are scattered in all directions by the gases and particles in the atmosphere (primarily nitrogen and oxygen). This process is called **Rayleigh scattering**.

3. **Why Blue?**: Our eyes are more sensitive to blue light than violet, and blue light is scattered more effectively than other colors because of its shorter wavelength. As a result, the sky appears blue to us.

4. **Sunset and Sunrise**: During sunrise and sunset, sunlight has to travel through more of the Earth's atmosphere to reach us. This causes even more scattering, leaving mostly red and orange wavelengths to reach our eyes, which is why the sky appears reddish during these times.

In summary, the sky is blue because blue light is scattered in all directions by the atmosphere, and our eyes perceive it as blue. [end of text]


llama_perf_sampler_print:    sampling time =      25.40 ms /   256 runs   (    0.10 ms per token, 10079.53 tokens per second)
llama_perf_context_print:        load time =  255587.05 ms
llama_perf_context_print: prompt eval time =    5525.97 ms /     9 tokens (  614.00 ms per token,     1.63 tokens per second)
llama_perf_context_print:        eval time =  107345.77 ms /   246 runs   (  436.36 ms per token,     2.29 tokens per second)
llama_perf_context_print:       total time =  113201.88 ms /   255 tokens

@ggerganov
Copy link
Owner

And also the output of:

GGML_SCHED_DEBUG=2 ./llama-eval-callback -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 -n 1 -lv 1 --prompt '<|User|>why is the sky blue?<|Assistant|>'

It's likely that this command will hang, but still provide the obtained logs up to the hang.

@dranger003
Copy link
Contributor

dranger003 commented Jan 8, 2025

Witnessing same behavior using CUDA, works fine on commit c792dcf (right before ggml sync). I'll see if I can run llama-eval-callback as requested.

EDIT: Not sure what's going on, I went back to latest commit and now it's working again.
EDIT2: The issue seems to only occur after a fresh reboot.

I really don't know what's happening, running llama-eval-callback seems to "reset" just like running an older commit does and it works, but the issue returns after a fresh reboot. Not sure if this is related or not at this point.

EDIT3: So I decided to reboot once more and run it again but this time let it sit there, and it actually eventually started to respond and below are the timings. It took almost 8 minutes before the first token appeared.

llama_perf_sampler_print:    sampling time =       2.32 ms /   106 runs   (    0.02 ms per token, 45630.65 tokens per second)
llama_perf_context_print:        load time =   59935.88 ms
llama_perf_context_print: prompt eval time =  461990.54 ms /    72 tokens ( 6416.54 ms per token,     0.16 tokens per second)
llama_perf_context_print:        eval time =    7333.56 ms /    33 runs   (  222.23 ms per token,     4.50 tokens per second)
llama_perf_context_print:       total time =  469676.92 ms /   105 tokens

Every subsequent runs appear to run faster (same identical cli command):

llama_perf_sampler_print:    sampling time =       3.55 ms /   128 runs   (    0.03 ms per token, 36076.66 tokens per second)
llama_perf_context_print:        load time =   39898.45 ms
llama_perf_context_print: prompt eval time =   99694.29 ms /    72 tokens ( 1384.64 ms per token,     0.72 tokens per second)
llama_perf_context_print:        eval time =   12351.07 ms /    55 runs   (  224.56 ms per token,     4.45 tokens per second)
llama_perf_context_print:       total time =  112453.91 ms /   127 tokens
llama_perf_sampler_print:    sampling time =       5.79 ms /   157 runs   (    0.04 ms per token, 27092.32 tokens per second)
llama_perf_context_print:        load time =   37186.03 ms
llama_perf_context_print: prompt eval time =   35917.89 ms /    72 tokens (  498.86 ms per token,     2.00 tokens per second)
llama_perf_context_print:        eval time =   19323.49 ms /    84 runs   (  230.04 ms per token,     4.35 tokens per second)
llama_perf_context_print:       total time =   55530.53 ms /   156 tokens

@emuchogu
Copy link
Author

emuchogu commented Jan 9, 2025

Here is the output for the ./llama-eval-callback command:
llama-eval-callback_logs_.txt

The command exhibits the same behavior, hanging with one of the GPUs pegged at 100%.

Command used:

GGML_SCHED_DEBUG=2 ./llama-eval-callback -m /models/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf -ngl 48 -n 1 -lv 1 --prompt '<|User|>why is the sky blue?<|Assistant|>' 

@ggerganov
Copy link
Owner

@emuchogu If you let it run for some minutes like @dranger003 did, does it eventually continue? This might be an extreme case of the issue discussed in #11005.

@emuchogu
Copy link
Author

emuchogu commented Jan 9, 2025

I ran the command for 30 minutes and observed same behavior.
One GPU remained pinned at 100% the entire time, and there was no output.

Attached is the log for the 30-minute run.
llama-eval-callback_logs_30min.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants