Skip to content

OpenCL: add tiled mul_mat_f16_f32 #14535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rmatif
Copy link
Collaborator

@rmatif rmatif commented Jul 4, 2025

This PR introduces a new mul_mat_f16_f32 kernel that leverages tiling and vectorization. I believe this will serve as a strong baseline for future improvements.
In a future PR, I may explore using image2d_t to utilize the L1 cache for mul_mat and conv2d operations. This is a bit tricky as it requires some data preprocessing on the host side

Results on Adreno 830:

Master:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 19.24 ± 0.88
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 18.87 ± 4.37

This PR:

model size params backend ngl test t/s
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 pp512 168.17 ± 0.41
llama 1B F16 2.30 GiB 1.24 B OpenCL 99 tg128 22.61 ± 0.02

@lhez @max-krasnyansky

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jul 4, 2025
@ggerganov ggerganov requested a review from max-krasnyansky July 4, 2025 17:57
@lhez
Copy link
Collaborator

lhez commented Jul 4, 2025

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

@zhouwg
Copy link
Contributor

zhouwg commented Jul 5, 2025

@rmatif, Sorry to bother you. Congratulations on your another excellent PR on ggml-opencl.

  1. Could you provide a script to setup local dev automatically and build the ggml-opencl for Android phone accordingly? This will be very useful/helpful for other developers whom also want to play with this excellent PR and reproduce the amazing benchmark data on 8Elite based phone. FYI here is a similar script to do the similar thing for simplify workflow of development&test backend for Qualcomm's Hexagon NPU on Android.
  2. Obviously, you are familiar with Android dev and very good at some areas in hardcore AI tech.
  3. It seems that Adreno 830 is the GPU component on Snapdragon 8Elite phone. Could you help to review my PR for Hexagon NPU on Android or reproduce benchmark data in my forked llama.cpp project ggml-hexagon if you have time?
  4. Refer to other existing backends, I think we can split PR for Hexagon NPU on Android to some steps:
    • Verify the codes on host side and then merge to master branch
    • Verify GGML_OP_ADD on cDSP side then merge to master branch
    • Verify fp32 mulmat on cDSP side then merge to master branch
    • You and other AI experts can add other ops accordingly .....

What do you think of this plan? Looking forward to your reply/advice and thanks.

@rmatif
Copy link
Collaborator Author

rmatif commented Jul 5, 2025

@rmatif thank you for the PR. I will play with it and the direct convolution PR in the next few days.

For matmul, using image1d_buffer is probably the easiest way to utilize L1 cache - it wraps around a normal cl buffer and uses read_image for access, so the index should stay the same as cl buffer. The Q4_0 matmul is already doing this. It is also possible to use normal cl buffer for one matrix input and image_1d_buffer to use both load paths.

@lhez You're right, using image1d_buffer is indeed a much simpler way to leverage the L1 cache. It avoids the need to manually handle row_pitch and the complexity of converting data into a 2D-tiled memory format, as it essentially acts as a "view" of an existing cl_buffer. I may begin by looking into that first as an incremental step.
However, I believe image2d_t is ultimately the best path forward, especially on Adreno, because its L1 cache is highly optimized for 2D spatial locality. MNN uses this technique extensively for its matmul op

What do you think of this plan? Looking forward to your reply/advice and thanks.

@zhouwg Please reach out to me via email, and I'll send you the build scripts and discuss further, as this seems off-topic here.
In short, my current take is that our time and effort would be better spent optimizing OpenCL, there’s still significant room for improvement. To me, it's not clear that we can achieve good enough performance on Hexagon for the moment

@zhouwg
Copy link
Contributor

zhouwg commented Jul 6, 2025

@rmatif, Thanks so much for your help. I'm so exciting that it's my first time to running the ggml-opencl backend on my Snapdragon 8Elite based phone.

llama-bench with qwen1_5-1_8b-chat-q4_0.gguf on master:

Screenshot from 2025-07-06 10-30-40


zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.6 MB/s (6962256 bytes in 0.378s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.6 MB/s (3407440 bytes in 0.185s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 17.4 MB/s (1476880 bytes in 0.081s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.0 MB/s (1764248 bytes in 0.099s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.5 MB/s (24163448 bytes in 1.245s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.7 MB/s (4863024 bytes in 0.261s)
6 files pushed. 18.1 MB/s (42637296 bytes in 2.252s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.7 MB/s (4770920 bytes in 0.258s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 10:04 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 10:04 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5849280 2025-07-05 08:04 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 10:04 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/qwen1_5-1_8b-chat-q4_0.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           pp512 |        329.52 ± 0.47 |
| qwen2 1B Q4_0                  |   1.04 GiB |     1.84 B | OpenCL     |  99 |       4 |           tg256 |         29.77 ± 0.06 |

build: a4701c4be (6025)
running time:2025-07-06,10:21:10


llama-cli with qwen1_5-1_8b-chat-q4_0.gguf on master:
Screenshot from 2025-07-06 10-31-41


zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamacli
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.5 MB/s (6962256 bytes in 0.380s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 19.2 MB/s (3407440 bytes in 0.169s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 26.5 MB/s (1476880 bytes in 0.053s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 21.0 MB/s (1764248 bytes in 0.080s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.7 MB/s (24163448 bytes in 1.234s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.7 MB/s (4863024 bytes in 0.262s)
6 files pushed. 18.6 MB/s (42637296 bytes in 2.184s)
./out/ggmlopencl-android/bin/llama-cli: 1 file pushed. 18.6 MB/s (27712544 bytes in 1.422s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 10:04 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 10:04 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5849280 2025-07-05 08:04 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 10:04 /data/local/tmp/libggml-opencl.so
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -no-cnv -m /sdcard/gemma-3n-E2B-it-Q8_0.gguf -p ""
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
build: 6025 (a4701c4be) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device GPUOpenCL (QUALCOMM Adreno(TM) 830) - 0 MiB free
llama_model_loader: loaded meta data with 42 key-value pairs and 727 tensors from /sdcard/gemma-3n-E2B-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 4.5B
llama_model_loader: - kv   3:                            general.license str              = gemma
llama_model_loader: - kv   4:                   general.base_model.count u32              = 1
llama_model_loader: - kv   5:                  general.base_model.0.name str              = Gemma 3n E4b It
llama_model_loader: - kv   6:          general.base_model.0.organization str              = Google
llama_model_loader: - kv   7:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["automatic-speech-recognition", "aut...
llama_model_loader: - kv   9:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv  10:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  11:                        gemma3n.block_count u32              = 30
llama_model_loader: - kv  12:                gemma3n.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  14:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  16:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  17:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  19:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  21:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  22:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  23:         gemma3n.attention.shared_kv_layers f32              = 10.000000
llama_model_loader: - kv  24:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644853, 1.644853, 1.644853, 1.6448...
llama_model_loader: - kv  25:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  37:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  39:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  40:               general.quantization_version u32              = 2
llama_model_loader: - kv  41:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  362 tensors
llama_model_loader: - type  f16:   93 tensors
llama_model_loader: - type q8_0:  272 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.45 GiB (8.59 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3n
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 30
print_info: n_head           = 8
print_info: n_head_kv        = 2
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 1.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = E2B
print_info: model params     = 4.46 B
print_info: general.name     = n/a
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  4482.04 MiB
load_tensors:       OpenCL model buffer size =    95.42 MiB
..........................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified:     OpenCL KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  4096 cells,   4 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache_unified:     OpenCL KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  1024 cells,  16 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:     OpenCL compute buffer size =   147.00 MiB
llama_context:        CPU compute buffer size =   516.00 MiB
llama_context: graph nodes  = 2881
llama_context: graph splits = 341
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

sampler seed: 1805276991
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 1



llama_perf_sampler_print:    sampling time =      67.92 ms /   257 runs   (    0.26 ms per token,  3784.03 tokens per second)
llama_perf_context_print:        load time =    3001.45 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =   41861.04 ms /   256 runs   (  163.52 ms per token,     6.12 tokens per second)
llama_perf_context_print:       total time =   44826.33 ms /   257 tokens

llama-bench with Llama-3.2-1B-Instruct-f16.gguf on this PR:

Screenshot from 2025-07-06 12-40-24

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 29.0 MB/s (6962256 bytes in 0.229s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 29.6 MB/s (3407440 bytes in 0.110s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 27.0 MB/s (1487528 bytes in 0.053s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 27.7 MB/s (1764248 bytes in 0.061s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 28.9 MB/s (24163448 bytes in 0.798s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 28.7 MB/s (4863024 bytes in 0.161s)
6 files pushed. 28.7 MB/s (42647944 bytes in 1.415s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 29.0 MB/s (4770920 bytes in 0.157s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:36 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:36 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1487528 2025-07-06 12:36 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels.....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |        155.24 ± 0.36 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         20.11 ± 0.04 |

build: 0de83f97c (6027)
running time:2025-07-06,12:38:05

llama-bench with Llama-3.2-1B-Instruct-f16.gguf on master:
Screenshot from 2025-07-06 12-50-41

zhouwg:$ ./scripts/build-run-ggmlopencl-android.sh run_llamabench
current working path:/home/zhouwg/kantvai/llama.cpp

/usr/bin/wget

/usr/bin/git

/usr/bin/ninja

/bin/ls
Android NDK already exist:   /home/zhouwg/kantvai/llama.cpp/prebuilts/android-ndk-r28 

OpenCL SDK already exist:    /home/zhouwg/kantvai/llama.cpp/prebuilts/OpenCL_SDK 

/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
/sdcard/qwen1_5-1_8b-chat-q4_0.gguf
the prebuild LLM model qwen1_5-1_8b-chat-q4_0.gguf already exist on Android phone
/sdcard/gemma-3n-E2B-it-Q8_0.gguf
the prebuild LLM model gemma-3n-E2B-it-Q8_0.gguf already exist on Android phone
./out/ggmlopencl-android/bin/libggml-base.so: 1 file pushed. 17.5 MB/s (6962256 bytes in 0.379s)
./out/ggmlopencl-android/bin/libggml-cpu.so: 1 file pushed. 17.3 MB/s (3407440 bytes in 0.188s)
./out/ggmlopencl-android/bin/libggml-opencl.so: 1 file pushed. 16.7 MB/s (1476880 bytes in 0.084s)
./out/ggmlopencl-android/bin/libggml.so: 1 file pushed. 17.2 MB/s (1764248 bytes in 0.098s)
./out/ggmlopencl-android/bin/libllama.so: 1 file pushed. 18.9 MB/s (24163448 bytes in 1.219s)
./out/ggmlopencl-android/bin/libmtmd.so: 1 file pushed. 17.2 MB/s (4863024 bytes in 0.269s)
6 files pushed. 18.2 MB/s (42637296 bytes in 2.239s)
./out/ggmlopencl-android/bin/llama-bench: 1 file pushed. 17.8 MB/s (4770920 bytes in 0.256s)
-rwxrwxrwx 1 shell shell 6962256 2025-07-06 12:42 /data/local/tmp/libggml-base.so
-rwxrwxrwx 1 shell shell 3407440 2025-07-06 12:42 /data/local/tmp/libggml-cpu.so
-rwxrwxrwx 1 shell shell 5848736 2025-07-06 10:56 /data/local/tmp/libggml-hexagon.so
-rwxrwxrwx 1 shell shell 1476880 2025-07-06 12:42 /data/local/tmp/libggml-opencl.so
adb shell "cd /data/local/tmp                && export LD_LIBRARY_PATH=/data/local/tmp                && /data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf"
/data/local/tmp/llama-bench  -ngl 99 -t 4 -n 256 --no-warmup  -m /sdcard/Llama-3.2-1B-Instruct-f16.gguf
ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'

ggml_opencl: device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.47.18.23
ggml_opencl: vector subgroup broadcast support: true
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 1024 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: loading OpenCL kernels....................................................
ggml_opencl: default device: 'QUALCOMM Adreno(TM) 830 (OpenCL 3.0 Adreno(TM) 830)'
| model                          |       size |     params | backend    | ngl | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           pp512 |         15.54 ± 1.89 |
| llama 1B F16                   |   2.30 GiB |     1.24 B | OpenCL     |  99 |       4 |           tg256 |         16.02 ± 0.03 |

build: 0de83f97c (6027)
running time:2025-07-06,12:50:12

BTW, I provide a simple build/shell script to build ggml-opencl backend on Linux for simplify workflow: https://github.com/zhouwg/ggml-hexagon/blob/self-build/scripts/build-run-ggmlopencl-android.sh

Can I add this script to this excellent PR or submit a standalone PR so other developers can help to verify ggml-opencl related PR or learning something about OpenCL programming on Android phone accordingly? I think such this script is easy/no technical difficulty but might-be very useful/helpful for other developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants