Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the QNN backend support the model of Llama 3.2 3B instead of XNNPACK? #9311

Open
tombang opened this issue Mar 17, 2025 · 9 comments
Open
Labels
module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ need-user-input The issue needs more information from the reporter before moving forward partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm

Comments

@tombang
Copy link

tombang commented Mar 17, 2025

🐛 Describe the bug

I have run the XNNPACK tutorial and the pte file can run normally. but when I run the tutorial of Llama 3 8B and change the model to Llama 3.2 3B, the model load fail on the Android device of Qualcomm 8Gen 2, the fail code is 1.

QNN:2.26
SDK:r27b

Versions

QNN:2.26
SDK:r27b

cc @cccclai @winskuo-quic @shewu-quic @cbilgin

@kimishpatel kimishpatel added the need-user-input The issue needs more information from the reporter before moving forward label Mar 17, 2025
@kimishpatel
Copy link
Contributor

Few follow ups

  1. Issue summary does not say if you are runnign using qnn workflow. can you iupdate with details?
  2. Please add your repro

@tombang
Copy link
Author

tombang commented Mar 18, 2025

I followed this tutorial (https://github.com/pytorch/executorch/blob/release/0.5/examples/demo-apps/android/LlamaDemo/docs/delegates/qualcomm_README.md) exactly, except that I replaced the model from Llama3 8B to Llama 3.2 3B during execution. The export command I used is as follows:

python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_8a8w -d fp32 --num_sharding 4 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --output_name="test.pte" --soc_model SM8550

This command successfully exports the model, but when running the APK on Android, a model loading error occurs with error code -1.

@kimishpatel kimishpatel added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Mar 19, 2025
@kimishpatel
Copy link
Contributor

Ok lets follow up on this.

@cccclai cccclai added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Mar 19, 2025
@cccclai
Copy link
Contributor

cccclai commented Mar 19, 2025

Can you try qnn version 2.28?

@tombang
Copy link
Author

tombang commented Mar 21, 2025

Hi,

I change the QNN to 2.28, there is an error occured:

[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
[INFO] [Qnn ExecuTorch]: create QNN Logger with log_level 2
[INFO] [Qnn ExecuTorch]: Initialize Qnn backend parameters for Qnn executorch backend type 2
[INFO] [Qnn ExecuTorch]: Caching: Caching is in RESTORE MODE.
[WARNING] [Qnn ExecuTorch]: Unknown QNN BinaryInfo version 3.
[ERROR] [Qnn ExecuTorch]: Failed to retrieve backend binary info from QNN context binary.
[ERROR] [Qnn ExecuTorch]: Failed to parse QNN Graph Info. The cache might be broken. Please consider to re-generate the cache.
[WARNING] [Qnn ExecuTorch]: QnnDsp Performance Estimates unsupported

[WARNING] [Qnn ExecuTorch]: QnnDsp Arch 68 set by custom config is different from arch associated with SoC 43, will overwrite it to 73

[ERROR] [Qnn ExecuTorch]: QNN context cache is invalid.
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/mnt/data/tmp/executorch/examples/models/llama/export_llama.py", line 32, in
main() # pragma: no cover
File "/mnt/data/tmp/executorch/examples/models/llama/export_llama.py", line 28, in main
export_llama(args)
File "/mnt/data/tmp/executorch/examples/models/llama/export_llama_lib.py", line 522, in export_llama
builder = _export_llama(args)
File "/mnt/data/tmp/executorch/examples/models/llama/export_llama_lib.py", line 801, in _export_llama
canonicalize_program(builder.edge_manager.exported_program())
File "/mnt/data/tmp/executorch/backends/qualcomm/utils/utils.py", line 281, in canonicalize_program
update_spill_fill_size(obj)
File "/mnt/data/tmp/executorch/backends/qualcomm/utils/utils.py", line 278, in update_spill_fill_size
update_program(*get_program_info(exported_program))
File "/mnt/data/tmp/executorch/backends/qualcomm/utils/utils.py", line 256, in get_program_info
return dispatchtype(program)
File "/mnt/data/tmp/executorch/backends/qualcomm/utils/utils.py", line 233, in process_exported_program
assert qnn_mgr.Init().value == 0, "failed to load context binary"
AssertionError: failed to load context binary
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend

The command of export is below:

python -m examples.models.llama.export_llama
-c "/home/ai/tb/LLM-Research/Llama-3___2-3B/original/consolidated.00.pth"
-p "/home/ai/tb/LLM-Research/Llama-3___2-3B/original/params.json"
-t /home/ai/tb/LLM-Research/Llama-3___2-3B/original/tokenizer.model
-kv --disable_dynamic_shape --qnn
--num_sharding 8
--pt2e_quantize qnn_16a4w
-d fp32
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}'
--output_name="test.pte"
--soc_model SM8550
--calibration_tasks wikitext
--calibration_limit 1
--calibration_seq_length 128
--calibration_data "<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are an helpful assistant. <|eot_id|><|start_header_id|>user<|end_header_id|> Name all the prime ministers of India. <|eot_id|><|start_header_id|>assistant<|end_header_id|>"

@james-p-xu
Copy link

I am hitting the same issue as @tombang, with QNN version 2.29 and on release/0.5 branch.

Unknown QNN BinaryInfo version 3.
Failed to retrieve backend binary info from QNN context binary.
Failed to parse QNN Graph Info. The cache might be broken. Please consider to re-generate the cache.
QNN context cache is invalid.
Fail to configure Qnn context
Fail to initialize Qnn Manager
Init failed for backend QnnBackend: 0x1

java.lang.Exception: Execution of method forward failed with status 0x1

I saw someone mention on a different thread that they got the Llama example working with 2.31 so unsure what is happening? Would appreciate any pointers here.

@kimishpatel
Copy link
Contributor

@cccclai can you tag someone from qcomm?

@cccclai
Copy link
Contributor

cccclai commented Mar 26, 2025

I am hitting the same issue as @tombang, with QNN version 2.29 and on release/0.5 branch.

Unknown QNN BinaryInfo version 3.
Failed to retrieve backend binary info from QNN context binary.
Failed to parse QNN Graph Info. The cache might be broken. Please consider to re-generate the cache.
QNN context cache is invalid.
Fail to configure Qnn context
Fail to initialize Qnn Manager
Init failed for backend QnnBackend: 0x1

java.lang.Exception: Execution of method forward failed with status 0x1
I saw someone mention on a different thread that they got the Llama example working with 2.31 so unsure what is happening? Would appreciate any pointers here.

It seems like a different issue, as the reported issue happens during export (ahead of time), and the issue from your side is runtime error (I assume you generate the .pte file already?)

@cccclai
Copy link
Contributor

cccclai commented Mar 26, 2025

@tombang Thank you for trying out the qnn + executorch solution. I've never tried 3b model with the executorch + qnn, but we have stories model set up in CI and it has been running fine

PYTHON_EXECUTABLE=python bash .ci/scripts/test_qnn_static_llama.sh
. The CI is using qnn v2.28. ndk is likely r27b based on the the docker setup for the CI
ANDROID_NDK_VERSION=r27b

While we're trying to improve the out-of-the-box experience, do you mind trying out a simpler model to make sure the setup is correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ need-user-input The issue needs more information from the reporter before moving forward partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm
Projects
None yet
Development

No branches or pull requests

4 participants