Cortex-M Quantization Operators Produce Incorrect Results

### 🐛 Describe the bug

When using the Arm backend *without* delegation to EthosU, the PTE graph utilizes the quantization operators associated with the Cortex M backend. Specifically, `cortex_m_dequantize_per_tensor_default` and `cortex_m_dequantize_per_tensor_default` are used in the example softmax model, as shown in the delegation table below produced via `python -m examples.arm.aot_arm_compiler --model_name="softmax"`:
```
Delegation table:
╒════╤════════════════════════════════════════╤═══════════════════════════════════╤═══════════════════════════════════════╕
│    │ op_type                                │   occurrences_in_delegated_graphs │   occurrences_in_non_delegated_graphs │
╞════╪════════════════════════════════════════╪═══════════════════════════════════╪═══════════════════════════════════════╡
│  0 │ aten_exp_default                       │                                 0 │                                     1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  1 │ aten_mul_tensor                        │                                 0 │                                     1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  2 │ aten_reciprocal_default                │                                 0 │                                     1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  3 │ aten_sum_dim_int_list                  │                                 0 │                                     1 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  4 │ cortex_m_dequantize_per_tensor_default │                                 0 │                                     6 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  5 │ cortex_m_quantize_per_tensor_default   │                                 0 │                                     5 │
├────┼────────────────────────────────────────┼───────────────────────────────────┼───────────────────────────────────────┤
│  6 │ Total                                  │                                 0 │                                    15 │
╘════╧════════════════════════════════════════╧═══════════════════════════════════╧═══════════════════════════════════════╛
```
However, using these quantization operators produces incorrect results. For reproducing the error, ensure that ExecuTorch has been installed and the Arm tools have been setup via:

```
./install_executorch.sh
./examples/arm/setup.sh --i-agree-to-the-contained-eula
source examples/arm/ethos-u-scratch/setup_path.sh
```

Correct results can be seem when we run without delegation or quantization, as:
```
./examples/arm/run.sh --model_name="softmax" --no_delegate --no_quantize
```
Produces:
```
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.500000
Output[0][1]: (float) 0.500000
Output[0][2]: (float) 0.500000
Output[0][3]: (float) 0.500000
```

However, running:
```
./examples/arm/run.sh --model_name="softmax"
```
Produces:
```
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.168896
Output[0][1]: (float) 0.168896
Output[0][2]: (float) 0.168896
Output[0][3]: (float) 0.168896
```
And running the following repairs the error by avoiding quantization altogether:
```
./examples/arm/run.sh --model_name="softmax" --no_quantize
```
Produces:
```
I [executorch:arm_executor_runner.cpp:696 print_outputs()] 1 outputs:
Output[0][0]: (float) 0.500000
Output[0][1]: (float) 0.500000
Output[0][2]: (float) 0.500000
Output[0][3]: (float) 0.500000
```

Running without delegation but with quantization produces a missing operator error:
```
./examples/arm/run.sh --model_name="softmax"  --no_delegate
```
Produces:
```
I [executorch:arm_executor_runner.cpp:948 main()] PTE in 0x70000000  Size: 4616 bytes
I [executorch:arm_executor_runner.cpp:473 runner_init()] PTE Model data loaded. Size: 4616 bytes.
I [executorch:arm_executor_runner.cpp:486 runner_init()] Model buffer loaded, has 1 methods
I [executorch:arm_executor_runner.cpp:493 runner_init()] Running method forward
I [executorch:arm_executor_runner.cpp:504 runner_init()] Setup Method allocator pool. Size: 62914560 bytes.
I [executorch:arm_executor_runner.cpp:521 runner_init()] Setting up planned buffer 0, size 48.
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::exp.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [2] aten::exp.out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::sum.IntList_out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [3] aten::sum.IntList_out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::reciprocal.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [4] aten::reciprocal.out
E [executorch:operator_registry.cpp:256 get_op_function_from_registry()] kernel 'aten::mul.out' not found.
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] dtype: 6 | dim order: [
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 0,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] 1,
I [executorch:operator_registry.cpp:257 get_op_function_from_registry()] ]
E [executorch:method.cpp:749 resolve_operator()] Missing operator: [5] aten::mul.out
E [executorch:method.cpp:1004 init()] There are 4 instructions don't have corresponding operator registered. See logs for details
I [executorch:arm_executor_runner.cpp:561 runner_init()] Loading of method forward failed with status 0x14
I [executorch:arm_executor_runner.cpp:569 runner_init()] Method 'forward' loaded.
I [executorch:arm_executor_runner.cpp:571 runner_init()] Preparing inputs...
F [executorch:result.h:170 CheckOk()] In function CheckOk(), assert failed: hasValue_
Hard fault. irq=-13, pc=0x34000000, lr=0x06800000, xpsr=0x10007c55, sp=0x3007fca0
            cfsr=0x00010000 bfar=0x00000000 mmfar=0x00000000
Application exit code: 1.
```


### Versions

```
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
--2025-08-13 17:09:32--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30687 (30K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py                                                     100%[===============================================================================================================================================================>]  29.97K  --.-KB/s    in 0s

2025-08-13 17:09:33 (219 MB/s) - ‘collect_env.py’ saved [30687/30687]

Collecting environment information...
PyTorch version: 2.9.0.dev20250725+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.10.0 (default, Mar  3 2022, 09:58:08) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               10
On-line CPU(s) list:                  0-9
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 7 165U
CPU family:                           6
Model:                                170
Thread(s) per core:                   2
Core(s) per socket:                   5
Socket(s):                            1
Stepping:                             4
BogoMIPS:                             5375.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves avx_vnni vnmi umip waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize flush_l1d arch_capabilities
Virtualization:                       VT-x
Hypervisor vendor:                    Microsoft
Virtualization type:                  full
L1d cache:                            240 KiB (5 instances)
L1i cache:                            320 KiB (5 instances)
L2 cache:                             10 MiB (5 instances)
L3 cache:                             12 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-9
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] executorch==0.8.0a0+0247f45
[pip3] flake8==6.1.0
[pip3] flake8-breakpoint==1.1.0
[pip3] flake8-bugbear==24.4.26
[pip3] flake8-comprehensions==3.14.0
[pip3] flake8-plugin-utils==1.3.3
[pip3] flake8-pyi==23.5.0
[pip3] mypy==1.14.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] pytorch_tokenizers==0.1.0
[pip3] torch==2.9.0.dev20250725+cpu
[pip3] torchao==0.13.0+git2eb4f9762
[pip3] torchaudio==2.8.0.dev20250725+cpu
[pip3] torchdata==0.11.0
[pip3] torchsr==1.0.4
[pip3] torchtune==0.6.1
[pip3] torchvision==0.24.0.dev20250725+cpu
[pip3] triton==3.3.1
[conda] executorch                0.8.0a0+0247f45          pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.6.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.6.80                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.6.77                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.6.77                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.5.1.17                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.0.4                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.7.77                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.1.2                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.4.2                 pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.6.85                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.6.77                  pypi_0    pypi
[conda] pytorch-tokenizers        0.1.0                    pypi_0    pypi
[conda] torch                     2.9.0.dev20250725+cpu          pypi_0    pypi
[conda] torchao                   0.13.0+git2eb4f9762          pypi_0    pypi
[conda] torchaudio                2.8.0.dev20250725+cpu          pypi_0    pypi
[conda] torchdata                 0.11.0                   pypi_0    pypi
[conda] torchfix                  0.6.0                    pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchtune                 0.6.1                    pypi_0    pypi
[conda] torchvision               0.24.0.dev20250725+cpu          pypi_0    pypi
[conda] triton                    3.3.1                    pypi_0    pypi
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cortex-M Quantization Operators Produce Incorrect Results #13399

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cortex-M Quantization Operators Produce Incorrect Results #13399

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions