-
Notifications
You must be signed in to change notification settings - Fork 646
Description
🐛 Describe the bug
When using the Arm backend without delegation to EthosU, the PTE graph utilizes the quantization operators associated with the Cortex M backend. Specifically, cortex_m_dequantize_per_tensor_default
and cortex_m_dequantize_per_tensor_default
are used in the example softmax model, as shown in the delegation table below produced via python -m examples.arm.aot_arm_compiler --model_name="softmax"
:
Delegation table:
╒════╤════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕
│ │ op_type │ occurrences_in_delegated_graphs │ occurrences_in_non_delegated_graphs │
╞════╪════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡
│ 0 │ aten_exp_default │ 0 │ 1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 1 │ aten_mul_tensor │ 0 │ 1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 2 │ aten_reciprocal_default │ 0 │ 1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 3 │ aten_sum_dim_int_list │ 0 │ 1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 4 │ cortex_m_dequantize_per_tensor_default │ 0 │ 6 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 5 │ cortex_m_quantize_per_tensor_default │ 0 │ 5 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│ 6 │ Total │ 0 │ 15 │
╘════╧════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛
However, using these quantization operators produces incorrect results. For reproducing the error, ensure that ExecuTorch has been installed and the Arm tools have been setup via:
./install_executorch.sh
./examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/ethos-u-scratch/setup_path.sh
Correct results can be seem when we run without delegation or quantization, as:
./examples/arm/run.sh --model_name="softmax" --no_delegate --no_quantize
Produces:
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.500000
Output[0][1]: (float) 0.500000
Output[0][2]: (float) 0.500000
Output[0][3]: (float) 0.500000
However, running:
./examples/arm/run.sh --model_name="softmax"
Produces:
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.168896
Output[0][1]: (float) 0.168896
Output[0][2]: (float) 0.168896
Output[0][3]: (float) 0.168896
And running the following repairs the error by avoiding quantization altogether:
./examples/arm/run.sh --model_name="softmax" --no_quantize
Produces:
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.500000
Output[0][1]: (float) 0.500000
Output[0][2]: (float) 0.500000
Output[0][3]: (float) 0.500000
Running without delegation but with quantization produces a missing operator error:
./examples/arm/run.sh --model_name="softmax" --no_delegate
Produces:
I [executorch:arm_executor_runner.cpp:948 main()] PTE in 0x70000000 Size: 4616 bytes
I [executorch:arm_executor_runner.cpp:473 runner_init()] PTE Model data loaded. Size: 4616 bytes.
I [executorch:arm_executor_runner.cpp:486 runner_init()] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:493 runner_init()] Running method forward
I [executorch:arm_executor_runner.cpp:504 runner_init()] Setup Method allocator pool. Size: 62914560 bytes.
I [executorch:arm_executor_runner.cpp:521 runner_init()] Setting up planned buffer 0, size 48.
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::exp.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [2] aten::exp.out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::sum.IntList_out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [3] aten::sum.IntList_out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::reciprocal.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [4] aten::reciprocal.out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::mul.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [5] aten::mul.out
E [executorch:method.cpp:1004 init()] There are 4 instructions don't have corresponding operator registered. See logs for details
I [executorch:arm_executor_runner.cpp:561 runner_init()] Loading of method forward failed with status 0x14
I [executorch:arm_executor_runner.cpp:569 runner_init()] Method 'forward' loaded.
I [executorch:arm_executor_runner.cpp:571 runner_init()] Preparing inputs...
F [executorch:result.h:170 CheckOk()] In function CheckOk(), assert failed: hasValue_
Hard fault. irq=-13, pc=0x34000000, lr=0x06800000, xpsr=0x10007c55, sp=0x3007fca0
cfsr=0x00010000 bfar=0x00000000 mmfar=0x00000000
Application exit code: 1.
Versions
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
--2025-08-13 17:09:32-- https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30687 (30K) [text/plain]
Saving to: ‘collect_env.py’
collect_env.py 100%[===============================================================================================================================================================>] 29.97K --.-KB/s in 0s
2025-08-13 17:09:33 (219 MB/s) - ‘collect_env.py’ saved [30687/30687]
Collecting environment information...
PyTorch version: 2.9.0.dev20250725+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39
Python version: 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 10
On-line CPU(s) list: 0-9
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) Ultra 7 165U
CPU family: 6
Model: 170
Thread(s) per core: 2
Core(s) per socket: 5
Socket(s): 1
Stepping: 4
BogoMIPS: 5375.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 240 KiB (5 instances)
L1i cache: 320 KiB (5 instances)
L2 cache: 10 MiB (5 instances)
L3 cache: 12 MiB (1 instance)
NUMA node(s): 1
NUMA node0 CPU(s): 0-9
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] executorch==0.8.0a0+0247f45
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] mypy==1.14.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch_tokenizers==0.1.0
[pip3] torch==2.9.0.dev20250725+cpu
[pip3] torchao==0.13.0+git2eb4f9762
[pip3] torchaudio==2.8.0.dev20250725+cpu
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.24.0.dev20250725+cpu
[pip3] triton==3.3.1
[conda] executorch 0.8.0a0+0247f45 pypi_0 pypi
[conda] numpy 2.2.6 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.6.4.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.6.80 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.6.77 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.5.1.17 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.3.0.4 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.7.77 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.7.1.2 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.5.4.2 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.26.2 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.6.85 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.6.77 pypi_0 pypi
[conda] pytorch-tokenizers 0.1.0 pypi_0 pypi
[conda] torch 2.9.0.dev20250725+cpu pypi_0 pypi
[conda] torchao 0.13.0+git2eb4f9762 pypi_0 pypi
[conda] torchaudio 2.8.0.dev20250725+cpu pypi_0 pypi
[conda] torchdata 0.11.0 pypi_0 pypi
[conda] torchfix 0.6.0 pypi_0 pypi
[conda] torchsr 1.0.4 pypi_0 pypi
[conda] torchtune 0.6.1 pypi_0 pypi
[conda] torchvision 0.24.0.dev20250725+cpu pypi_0 pypi
[conda] triton 3.3.1 pypi_0 pypi