From 5188a28273eb1a7f40b23e439a34fc375145a8a0 Mon Sep 17 00:00:00 2001 From: Momchil Velikov Date: Mon, 29 Jul 2024 15:09:30 +0100 Subject: [PATCH] [fixup] Another naming scheme --- main/acle.md | 344 +++++++++--------- neon_intrinsics/advsimd.md | 290 +++++++-------- tools/intrinsic_db/advsimd.csv | 290 +++++++-------- tools/intrinsic_db/advsimd_classification.csv | 286 +++++++-------- 4 files changed, 607 insertions(+), 603 deletions(-) diff --git a/main/acle.md b/main/acle.md index 67537ef6..26d2f010 100644 --- a/main/acle.md +++ b/main/acle.md @@ -747,7 +747,7 @@ The predefined types are: * The `__bf16` type for 16-bit brain floating-point values (see [Half-precision brain floating-point](#half-precision-brain-floating-point)). -* The `__fpm8` type for the modal 8-bit floating-point values (see +* The `__mfp8` type for the modal 8-bit floating-point values (see [Modal 8-bit floating point types](#modal-8-bit-floating-point) ### Implementation-defined type properties @@ -1246,7 +1246,7 @@ conversions when not implemented in hardware is implementation-defined. ### Modal 8-bit floating-point -ACLE defines the `__fpm8` type, which can be used for the E5M2 and E4M3 +ACLE defines the `__mfp8` type, which can be used for the E5M2 and E4M3 8-bit floating-point formats. It is a storage and interchange only type with no arithmetic operations other than intrinsic calls. @@ -5612,10 +5612,14 @@ The bits of an argument to an `fpm` parameter are interpreted as follows: Bit patterns other than as described above are invalid. Passing an invalid value as an argument to an FP8 intrinsic results in undefined behavior. -The ACLE declares several helper types and intrisics to +The ACLE declares several helper types and intrinsics to facilitate construction of `fpm` arguments. The helper intrinsics do not have side effects and their return depends only on their parameters. +The helper types and intrinsics are available after including any of +[``](#arm_neon.h), [``](#arm_sve.h), or +[``](#arm_sme.h). + Note: where a helper intrinsic description refers to "updating the FP8 mode" it means the intrinsic only modifies the bits of input `fpm_t` parameter that correspond to the new mode and returns the resulting value. No side effects @@ -5777,7 +5781,7 @@ names are based on the types defined in ``. For example,. `int64_t`, `uint64_t`, `float16_t`, `float32_t`, `poly8_t`, `poly16_t`, `poly64_t`, `poly128_t` , `bfloat16_t`. The multiples are such that the resulting vector types are 64-bit and 128-bit. In AArch64, `float64_t` -and `floatm8_t` are also base types. +and `mfloat8_t` are also base types. Not all types can be used in all operations. Generally, the operations available on a type correspond to the operations available on the @@ -5795,7 +5799,7 @@ bfloat types are only available when the `__bf16` type is defined, i.e. when supported by the hardware. The bfloat types are all opaque types. That is to say they can only be used by intrinsics. -FP8 types are only available when the `__fpm8` type is defined, i.e. +FP8 types are only available when the `__mfp8` type is defined, i.e. when supported by the hardware. The FP8 types are all opaque types. That is to say they can only be used by intrinsics. @@ -5837,7 +5841,7 @@ it. If the `__bf16` type is defined, `bfloat16_t` is defined as an alias for it. -If the `__fpm8` type is defined, `floatm8_t` is defined as an alias for it. +If the `__mfp8` type is defined, `mfloat8_t` is defined as an alias for it. `poly8_t`, `poly16_t`, `poly64_t` and `poly128_t` are defined as unsigned integer types. It is unspecified whether these are the same type as @@ -6600,7 +6604,7 @@ In addition, the header file defines the following scalar data types: | `float16_t` | equivalent to `__fp16` | | `float32_t` | equivalent to `float` | | `float64_t` | equivalent to `double` | -| `floatm8_t` | equivalent to `__fpm8` | +| `mfloat8_t` | equivalent to `__mfp8` | If the feature macro `__ARM_FEATURE_BF16_SCALAR_ARITHMETIC` is defined, [``](#arm_sve.h) also includes @@ -6615,7 +6619,7 @@ single vectors: | **Signed integer** | **Unsigned integer** | **Floating-point** | | | -------------------- | -------------------- | -------------------- | -------------------- | -| `svint8_t` | `svuint8_t` | | `svfloatm8_t | +| `svint8_t` | `svuint8_t` | | `svmfloat8_t | | `svint16_t` | `svuint16_t` | `svfloat16_t` | `svbfloat16_t` | | `svint32_t` | `svuint32_t` | `svfloat32_t` | | | `svint64_t` | `svuint64_t` | `svfloat64_t` | | @@ -6636,7 +6640,7 @@ vectors, as follows: | **Signed integer** | **Unsigned integer** | **Floating-point** | | | -------------------- | -------------------- | --------------------- | -------------------- | -| `svint8x2_t` | `svuint8x2_t` | | `svfloatm8x2_t` | +| `svint8x2_t` | `svuint8x2_t` | | `svmfloat8x2_t` | | `svint16x2_t` | `svuint16x2_t` | `svfloat16x2_t` | `svbfloat16x2_t` | | `svint32x2_t` | `svuint32x2_t` | `svfloat32x2_t` | | | `svint64x2_t` | `svuint64x2_t` | `svfloat64x2_t` | | @@ -6646,7 +6650,7 @@ vectors, as follows: | `svint32x3_t` | `svuint32x3_t` | `svfloat32x3_t` | | | `svint64x3_t` | `svuint64x3_t` | `svfloat64x3_t` | | | | | | | -| `svint8x4_t` | `svuint8x4_t` | | `svfloatm8x4_t` | +| `svint8x4_t` | `svuint8x4_t` | | `svmfloat8x4_t` | | `svint16x4_t` | `svuint16x4_t` | `svfloat16x4_t` | `svbfloat16x4_t` | | `svint32x4_t` | `svuint32x4_t` | `svfloat32x4_t` | | | `svint64x4_t` | `svuint64x4_t` | `svfloat64x4_t` | | @@ -9039,7 +9043,7 @@ Broadcast indexed element within each quadword vector segment. ``` c // Variants are also available for: // _s8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svdup_laneq[_u8](svuint8_t zn, uint64_t imm_idx); ``` @@ -9050,7 +9054,7 @@ Extract vector segment from each pair of quadword segments. ``` c // Variants are also available for: // _s8, _s16, _u16, _s32, _u32, _s64, _u64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svextq[_u8](svuint8_t zdn, svuint8_t zm, uint64_t imm); ``` #### LD1D, LD1W @@ -9077,7 +9081,7 @@ Gather Load Quadword. ``` c // Variants are also available for: // _u8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svint8_t svld1q_gather[_u64base]_s8(svbool_t pg, svuint64_t zn); svint8_t svld1q_gather[_u64base]_offset_s8(svbool_t pg, svuint64_t zn, int64_t offset); svint8_t svld1q_gather_[u64]offset[_s8](svbool_t pg, const int8_t *base, svuint64_t offset); @@ -9096,7 +9100,7 @@ Contiguous load two, three or four quadword structures. ``` c // Variants are also available for: // _u8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svint8x2_t svld2q[_s8](svbool_t pg, const int8_t *rn); svint8x2_t svld2q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum); svint8x3_t svld3q[_s8](svbool_t pg, const int8_t *rn); @@ -9171,7 +9175,7 @@ Scatter store quadwords. ``` c // Variants are also available for: // _u8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 void svst1q_scatter[_u64base][_s8](svbool_t pg, svuint64_t zn, svint8_t data); void svst1q_scatter[_u64base]_offset[_s8](svbool_t pg, svuint64_t zn, int64_t offset, svint8_t data); void svst1q_scatter_[u64]offset[_s8](svbool_t pg, const uint8_t *base, svuint64_t offset, svint8_t data); @@ -9189,7 +9193,7 @@ Contiguous store. ``` c // Variants are also available for: // _s8 _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 void svst2q[_u8](svbool_t pg, uint8_t *rn, svuint8x2_t zt); void svst2q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x2_t zt); void svst3q[_u8](svbool_t pg, uint8_t *rn, svuint8x3_t zt); @@ -9205,7 +9209,7 @@ Programmable table lookup within each quadword vector segment (zeroing). ``` c // Variants are also available for: // _u8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svint8_t svtblq[_s8](svint8_t zn, svuint8_t zm); ``` @@ -9216,7 +9220,7 @@ Programmable table lookup within each quadword vector segment (merging). ``` c // Variants are also available for: // _u8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svint8_t svtbxq[_s8](svint8_t fallback, svint8_t zn, svuint8_t zm); ``` @@ -9227,7 +9231,7 @@ Concatenate elements within each pair of quadword vector segments. ``` c // Variants are also available for: // _s8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svuzpq1[_u8](svuint8_t zn, svuint8_t zm); svuint8_t svuzpq2[_u8](svuint8_t zn, svuint8_t zm); ``` @@ -9239,7 +9243,7 @@ Interleave elements from halves of each pair of quadword vector segments. ``` c // Variants are also available for: // _s8, _u16, _s16, _u32, _s32, _u64, _s64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svzipq1[_u8](svuint8_t zn, svuint8_t zm); svuint8_t svzipq2[_u8](svuint8_t zn, svuint8_t zm); ``` @@ -9250,68 +9254,68 @@ Interleave elements from halves of each pair of quadword vector segments. 8-bit floating-point convert to BFloat16. ``` c - svbfloat16_t svcvt1_bf16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); - svbfloat16_t svcvt2_bf16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); + svbfloat16_t svcvt1_bf16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); + svbfloat16_t svcvt2_bf16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); ``` #### BF1CVTLT, BF2CVTLT 8-bit floating-point convert to BFloat16 (top). ``` c - svbfloat16_t svcvtlt1_bf16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); - svbfloat16_t svcvtlt2_bf16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); + svbfloat16_t svcvtlt1_bf16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); + svbfloat16_t svcvtlt2_bf16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); ``` #### BFCVTN BFloat16 convert, narrow and interleave to 8-bit floating-point. ``` c - svfloatm8_t svcvtn_fm8[_bf16_x2]_fpm(svbfloat16x2_t zn, fpm_t fpm); + svmfloat8_t svcvtn_mf8[_bf16_x2]_fpm(svbfloat16x2_t zn, fpm_t fpm); ``` #### F1CVT, F2CVT 8-bit floating-point convert to half-precision. ``` c - svfloat16_t svcvt1_f16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); - svfloat16_t svcvt2_f16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); + svfloat16_t svcvt1_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); + svfloat16_t svcvt2_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); ``` #### F1CVTLT, F2CVTLT 8-bit floating-point convert to half-precision (top). ``` c - svfloat16_t svcvtlt1_f16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); - svfloat16_t svcvtlt2_f16[_fm8]_fpm(svfloatm8_t zn, fpm_t fpm); + svfloat16_t svcvtlt1_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); + svfloat16_t svcvtlt2_f16[_mf8]_fpm(svmfloat8_t zn, fpm_t fpm); ``` #### FCVTN Half-precision convert, narrow and interleave to 8-bit floating-point. ``` c - svfloatm8_t svcvtn_fm8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm); + svmfloat8_t svcvtn_mf8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm); ``` #### FCVTNT, FCVTNB Single-precision convert, narrow and interleave to 8-bit floating-point (top and bottom). ``` c - svfloatm8_t svcvtnt_fm8[_f32_x2]_fpm(svfloatm8_t zd, svfloat32x2_t zn, fpm_t fpm); - svfloatm8_t svcvtnb_fm8[_f32_x2]_fpm(svfloatm8_t zd, svfloat32x2_t zn, fpm_t fpm); + svmfloat8_t svcvtnt_mf8[_f32_x2]_fpm(svmfloat8_t zd, svfloat32x2_t zn, fpm_t fpm); + svmfloat8_t svcvtnb_mf8[_f32_x2]_fpm(svmfloat8_t zd, svfloat32x2_t zn, fpm_t fpm); ``` #### FDOT (4-way, vectors) 8-bit floating-point dot product to single-precision. ``` c - svfloat32_t svdot[_f32_fm8]_fpm(svfloat32_t zda ,svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); + svfloat32_t svdot[_f32_mf8]_fpm(svfloat32_t zda ,svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); ``` #### FDOT (4-way, indexed) 8-bit floating-point indexed dot product to single-precision. ``` c - svfloat32_t svdot_lane[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat32_t svdot_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_3, fpm_t fpm); ``` @@ -9319,14 +9323,14 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point dot product to half-precision. ``` c - svfloat16_t svdot[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); + svfloat16_t svdot[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); ``` #### FDOT (2-way, indexed, FP8 to FP16) 8-bit floating-point dot product to half-precision. ``` c - svfloat16_t svdot_lane[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat16_t svdot_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_7, fpm_t fpm); ``` @@ -9334,15 +9338,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long to half-precision (bottom). ``` c - svfloat16_t svmlalb[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat16_t svmlalb[_n_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat16_t svmlalb[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat16_t svmlalb[_n_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALB (indexed, FP8 to FP16) 8-bit floating-point multiply-add long to half-precision (bottom, indexed). ``` c - svfloat16_t svmlalb_lane[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat16_t svmlalb_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -9350,15 +9354,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long long to single-precision (bottom bottom). ``` c - svfloat32_t svmlallbb[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat32_t svmlallbb[_n_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat32_t svmlallbb[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat32_t svmlallbb[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALLBB (indexed) 8-bit floating-point multiply-add long long to single-precision (bottom bottom, indexed). ``` c - svfloat32_t svmlallbb_lane[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat32_t svmlallbb_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -9366,15 +9370,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long long to single-precision (bottom top). ``` c - svfloat32_t svmlallbt[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat32_t svmlallbt[_n_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat32_t svmlallbt[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat32_t svmlallbt[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALLBT (indexed) 8-bit floating-point multiply-add long long to single-precision (bottom top, indexed). ``` c - svfloat32_t svmlallbt_lane[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat32_t svmlallbt_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -9382,15 +9386,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long long to single-precision (top bottom). ``` c - svfloat32_t svmlalltb[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat32_t svmlalltb[_n_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat32_t svmlalltb[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat32_t svmlalltb[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALLTB (indexed) 8-bit floating-point multiply-add long long to single-precision (top bottom, indexed). ``` c - svfloat32_t svmlalltb_lane[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat32_t svmlalltb_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -9398,15 +9402,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long long to single-precision (top top). ``` c - svfloat32_t svmlalltt[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat32_t svmlalltt[_n_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat32_t svmlalltt[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat32_t svmlalltt[_n_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALLTT (indexed) 8-bit floating-point multiply-add long long to single-precision (top top, indexed). ``` c - svfloat32_t svmlalltt_lane[_f32_fm8]_fpm(svfloat32_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat32_t svmlalltt_lane[_f32_mf8]_fpm(svfloat32_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -9414,15 +9418,15 @@ Single-precision convert, narrow and interleave to 8-bit floating-point (top and 8-bit floating-point multiply-add long to half-precision (top). ```c - svfloat16_t svmlalt[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm); - svfloat16_t svmlalt[_n_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, floatm8_t zm, fpm_t fpm); + svfloat16_t svmlalt[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm); + svfloat16_t svmlalt[_n_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, mfloat8_t zm, fpm_t fpm); ``` #### FMLALT (indexed, FP8 to FP16) 8-bit floating-point multiply-add long to half-precision (top, indexed). ```c - svfloat16_t svmlalt_lane[_f16_fm8]_fpm(svfloat16_t zda, svfloatm8_t zn, svfloatm8_t zm, + svfloat16_t svmlalt_lane[_f16_mf8]_fpm(svfloat16_t zda, svmfloat8_t zn, svmfloat8_t zm, uint64_t imm0_15, fpm_t fpm); ``` @@ -11841,33 +11845,33 @@ Zero ZT0 Lookup table read with 2-bit and 4-bit indexes ``` c - // Variants are also available for _zt_u8, _zt_fm8, _zt_s16, _zt_u16, _zt_f16, + // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8_t svluti2_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx) __arm_streaming __arm_in("zt0"); - // Variants are also available for _zt_u8, _zt_fm8, _zt_s16, _zt_u16, _zt_f16, + // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x2_t svluti2_lane_zt_s8_x2(uint64_t zt, svuint8_t zn, uint64_t imm_idx) __arm_streaming __arm_in("zt0"); - // Variants are also available for _zt_u8, _zt_fm8, _zt_s16, _zt_u16, _zt_f16, + // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x4_t svluti2_lane_zt_s8_x4(uint64_t zt, svuint8_t zn, uint64_t imm_idx) __arm_streaming __arm_in("zt0"); - // Variants are also available for _zt_u8, _zt_fm8, _zt_s16, _zt_u16, _zt_f16, + // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8_t svluti4_lane_zt_s8(uint64_t zt, svuint8_t zn, uint64_t imm_idx) __arm_streaming __arm_in("zt0"); - // Variants are also available for _zt_u8, _zt_fm8, _zt_s16, _zt_u16, _zt_f16, + // Variants are also available for _zt_u8, _zt_mf8, _zt_s16, _zt_u16, _zt_f16, // _zt_bf16, _zt_s32, _zt_u32 and _zt_f32 svint8x2_t svluti4_lane_zt_s8_x2(uint64_t zt, svuint8_t zn, uint64_t imm_idx) @@ -11886,84 +11890,84 @@ Lookup table read with 2-bit and 4-bit indexes Move multi-vectors to/from ZA ``` c - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_hor_za8_s8_vg2(uint64_t tile, uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_hor_za8_s8_vg4(uint64_t tile, uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_ver_za8_s8_vg2(uint64_t tile, uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_ver_za8_s8_vg4(uint64_t tile, uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x2_t svread_za8_s8_vg1x2(uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8_u8, _za8_fm8, _za16_s16, _za16_u16, + // Variants are also available for _za8_u8, _za8_mf8, _za16_s16, _za16_u16, // _za16_f16, _za16_bf16, _za32_s32, _za32_u32, _za32_f32, // _za64_s64, _za64_u64 and _za64_f64 svint8x4_t svread_za8_s8_vg1x4(uint32_t slice) __arm_streaming __arm_in("za"); - // Variants are also available for _za8[_u8], _za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_hor_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn) __arm_streaming __arm_inout("za"); - // Variants are also available for _za8[_u8], _za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_hor_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn) __arm_streaming __arm_inout("za"); - // Variants are also available for _za8[_u8], _za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_ver_za8[_s8]_vg2(uint64_t tile, uint32_t slice, svint8x2_t zn) __arm_streaming __arm_inout("za"); - // Variants are also available for _za8[_u8], _za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_ver_za8[_s8]_vg4(uint64_t tile, uint32_t slice, svint8x4_t zn) __arm_streaming __arm_inout("za"); - // Variants are also available for _za8[_u8], _za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], _za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_za8[_s8]_vg1x2(uint32_t slice, svint8x2_t zn) __arm_streaming __arm_inout("za"); - // Variants are also available for _za8[_u8], za8[_fm8], _za16[_s16], _za16[_u16], + // Variants are also available for _za8[_u8], za8[_mf8], _za16[_s16], _za16[_u16], // _za16[_f16], _za16[_bf16], _za32[_s32], _za32[_u32], _za32[_f32], // _za64[_s64], _za64[_u64] and _za64[_f64] void svwrite_za8[_s8]_vg1x4(uint32_t slice, svint8x4_t zn) @@ -11997,13 +12001,13 @@ Multi-vector clamp to minimum/maximum vector Multi-vector conditionally select elements from two vectors ``` c - // Variants are also available for _s8_x2, _fm8_x2, _u16_x2, _s16_x2, _f16_x2, + // Variants are also available for _s8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2, // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2 svuint8x2_t svsel[_u8_x2](svcount_t png, svuint8x2_t zn, svuint8x2_t zm) __arm_streaming; - // Variants are also available for _s8_x4, _fm8_x4, _u16_x4, _s16_x4, _f16_x4, + // Variants are also available for _s8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4, // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4 svuint8x4_t svsel[_u8_x4](svcount_t png, svuint8x4_t zn, svuint8x4_t zm) __arm_streaming; @@ -12153,12 +12157,12 @@ Multi-vector pack/unpack Multi-vector zip. ``` c - // Variants are also available for _u8_x2, _fm8_x2, _u16_x2, _s16_x2, _f16_x2, + // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2, // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2 svint8x2_t svzip[_s8_x2](svint8x2_t zn) __arm_streaming; - // Variants are also available for _u8_x4, _fm8_x4, _u16_x4, _s16_x4, _f16_x4, + // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4, // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4 svint8x4_t svzip[_s8_x4](svint8x4_t zn) __arm_streaming; ``` @@ -12168,12 +12172,12 @@ element types. ``` c - // Variants are also available for _u8_x2, _fm8_x2, _u16_x2, _s16_x2, _f16_x2, + // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2, // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2 svint8x2_t svzipq[_s8_x2](svint8x2_t zn) __arm_streaming; - // Variants are also available for _u8_x4, _fm8_x4, _u16_x4, _s16_x4, _f16_x4, + // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4, // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4 svint8x4_t svzipq[_s8_x4](svint8x4_t zn) __arm_streaming; ``` @@ -12183,12 +12187,12 @@ element types. Multi-vector unzip. ``` c - // Variants are also available for _u8_x2, _fm8_x2, _u16_x2, _s16_x2, _f16_x2, + // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2, // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2 svint8x2_t svuzp[_s8_x2](svint8x2_t zn) __arm_streaming; - // Variants are also available for _u8_x4, _fm8_x4, _u16_x4, _s16_x4, _f16_x4, + // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4, // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4 svint8x4_t svuzp[_s8_x4](svint8x4_t zn) __arm_streaming; ``` @@ -12197,12 +12201,12 @@ The `svuzpq` intrinsics operate on quad-words, but for convenience accept all element types. ``` c - // Variants are also available for _u8_x2, _fm8_x2, _u16_x2, _s16_x2, _f16_x2, + // Variants are also available for _u8_x2, _mf8_x2, _u16_x2, _s16_x2, _f16_x2, // _bf16_x2, _u32_x2, _s32_x2, _f32_x2, _u64_x2, _s64_x2 and _f64_x2 svint8x2_t svuzpq[_s8_x2](svint8x2_t zn) __arm_streaming; - // Variants are also available for _u8_x4, _fm8_x4, _u16_x4, _s16_x4, _f16_x4, + // Variants are also available for _u8_x4, _mf8_x4, _u16_x4, _s16_x4, _f16_x4, // _bf16_x4, _u32_x4, _s32_x4, _f32_x4, _u64_x4, _s64_x4 and _f64_x4 svint8x4_t svuzpq[_s8_x4](svint8x4_t zn) __arm_streaming; ``` @@ -12292,20 +12296,20 @@ Multi-vector dot-product (2-way) Contiguous load to multi-vector ``` c - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x2_t svld1[_u8]_x2(svcount_t png, const uint8_t *rn); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x4_t svld1[_u8]_x4(svcount_t png, const uint8_t *rn); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x2_t svld1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, int64_t vnum); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x4_t svld1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, int64_t vnum); @@ -12369,20 +12373,20 @@ Contiguous load to multi-vector Contiguous non-temporal load to multi-vector ``` c - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x2_t svldnt1[_u8]_x2(svcount_t png, const uint8_t *rn); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x4_t svldnt1[_u8]_x4(svcount_t png, const uint8_t *rn); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x2_t svldnt1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, int64_t vnum); - // Variants are also available for _s8, _fm8 + // Variants are also available for _s8, _mf8 svuint8x4_t svldnt1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, int64_t vnum); @@ -12506,19 +12510,19 @@ Reverse doublewords in elements. // All the intrinsics below are [SME] // Variants are available for: // _s8, _s16, _u16, _s32, _u32, _s64, _u64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svrevd[_u8]_m(svuint8_t zd, svbool_t pg, svuint8_t zn); // Variants are available for: // _s8, _s16, _u16, _s32, _u32, _s64, _u64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svrevd[_u8]_z(svbool_t pg, svuint8_t zn); // Variants are available for: // _s8, _s16, _u16, _s32, _u32, _s64, _u64 - // _bf16, _f16, _f32, _f64, _fm8 + // _bf16, _f16, _f32, _f64, _mf8 svuint8_t svrevd[_u8]_x(svbool_t pg, svuint8_t zn); ``` @@ -12553,20 +12557,20 @@ Multi-vector saturating rounding shift right unsigned narrow and interleave Contiguous store of multi-vector operand ``` c - // Variants are also available for _s8_x2, _fm8_x2 + // Variants are also available for _s8_x2, _mf8_x2 void svst1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt); - // Variants are also available for _s8_x4, _fm8_x4 + // Variants are also available for _s8_x4, _mf8_x4 void svst1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt); - // Variants are also available for _s8_x2, _fm8_x2 + // Variants are also available for _s8_x2, _mf8_x2 void svst1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, svuint8x2_t zt); - // Variants are also available for _s8_x4, _fm8_x4 + // Variants are also available for _s8_x4, _mf8_x4 void svst1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, svuint8x4_t zt); @@ -12630,20 +12634,20 @@ Contiguous store of multi-vector operand Contiguous non-temporal store of multi-vector operand ``` c - // Variants are also available for _s8_x2, _fm8_x2 + // Variants are also available for _s8_x2, _mf8_x2 void svstnt1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt); - // Variants are also available for _s8_x4, _fm8_x4 + // Variants are also available for _s8_x4, _mf8_x4 void svstnt1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt); - // Variants are also available for _s8_x2, _fm8_x2 + // Variants are also available for _s8_x2, _mf8_x2 void svstnt1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, svuint8x2_t zt); - // Variants are also available for _s8_x4, _fm8_x4 + // Variants are also available for _s8_x4, _mf8_x4 void svstnt1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, svuint8x4_t zt); @@ -12760,33 +12764,33 @@ While (resulting in predicate tuple) 8-bit floating-point convert to half-precision or BFloat16. ``` c - // Variant is also available for: _bf16[_fm8]_x2 - svfloat16x2_t svcvt1_f16[_fm8]_x2_fpm(svfloatm8_t zn, fpm_t fpm) __arm_streaming; - svfloat16x2_t svcvt2_f16[_fm8]_x2_fpm(svfloatm8_t zn, fpm_t fpm) __arm_streaming; + // Variant is also available for: _bf16[_mf8]_x2 + svfloat16x2_t svcvt1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming; + svfloat16x2_t svcvt2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming; ``` #### F1CVTL, F2CVTL 8-bit floating-point convert to deinterleaved half-precision or BFloat16. ``` c - // Variant is also available for: _bf16[_fm8]_x2 - svfloat16x2_t svcvtl1_f16[_fm8]_x2_fpm(svfloatm8_t zn, fpm_t fpm) __arm_streaming; - svfloat16x2_t svcvtl2_f16[_fm8]_x2_fpm(svfloatm8_t zn, fpm_t fpm) __arm_streaming; + // Variant is also available for: _bf16[_mf8]_x2 + svfloat16x2_t svcvtl1_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming; + svfloat16x2_t svcvtl2_f16[_mf8]_x2_fpm(svmfloat8_t zn, fpm_t fpm) __arm_streaming; ``` #### FCVT Convert to packed 8-bit floating-point format. ``` c - // Variants are also available for: _fm8[_bf16_x2] and _fm8[_f32_x4] - svfloatm8_t svcvt_fm8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm) __arm_streaming; + // Variants are also available for: _mf8[_bf16_x2] and _mf8[_f32_x4] + svmfloat8_t svcvt_mf8[_f16_x2]_fpm(svfloat16x2_t zn, fpm_t fpm) __arm_streaming; ``` #### FCVTN Convert to interleaved 8-bit floating-point format. ``` c - svfloatm8_t svcvtn_fm8[_f32_x4]_fpm(svfloat32x4_t zn, fpm_t fpm) __arm_streaming; + svmfloat8_t svcvtn_mf8[_f32_x4]_fpm(svfloat32x4_t zn, fpm_t fpm) __arm_streaming; ``` #### FSCALE @@ -12807,8 +12811,8 @@ Convert to interleaved 8-bit floating-point format. Multi-vector 8-bit floating-point vertical dot-product by indexed element to half-precision. ``` c - void svvdot_lane_za16[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svvdot_lane_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -12816,26 +12820,26 @@ half-precision. Multi-vector 8-bit floating-point dot-product. ``` c - void svdot_lane_za16[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svdot_lane_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot_lane_za16[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svdot_lane_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot[_single]_za16[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, fpm_t fpm) + void svdot[_single]_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot[_single]_za16[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, fpm_t fpm) + void svdot[_single]_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot_za16[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, svfloatm8x2_t zm, + void svdot_za16[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot_za16[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, svfloatm8x4_t zm, + void svdot_za16[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -12843,34 +12847,34 @@ Multi-vector 8-bit floating-point dot-product. Multi-vector 8-bit floating-point multiply-add long. ``` c - void svmla_lane_za16[_fm8]_vg2x1_fpm(uint32_t slice, svfloatm8_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svmla_lane_za16[_mf8]_vg2x1_fpm(uint32_t slice, svmfloat8_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_lane_za16[_fm8]_vg2x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svmla_lane_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_lane_za16[_fm8]_vg2x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, uint64_t imm_idx + void svmla_lane_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, uint64_t imm_idx fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla[_single]_za16[_fm8]_vg2x1_fpm(uint32_t slice, svfloatm8_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za16[_mf8]_vg2x1_fpm(uint32_t slice, svmfloat8_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla[_single]_za16[_fm8]_vg2x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla[_single]_za16[_fm8]_vg2x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_za16[_fm8]_vg2x2_fpm(uint32_t slice, svfloatm8x2_t zn, svfloatm8x2_t zm, + void svmla_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_za16[_fm8]_vg2x4_fpm(uint32_t slice, svfloatm8x4_t zn, svfloatm8x4_t zm, + void svmla_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -12878,34 +12882,34 @@ Multi-vector 8-bit floating-point multiply-add long. 8-bit floating-point sum of outer products and accumulate. ``` c - void svmopa_za16[_fm8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm, - svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); + void svmopa_za16[_mf8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm, + svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` #### FDOT Multi-vector 8-bit floating-point dot-product. ``` c - void svdot_lane_za32[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svdot_lane_za32[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot_lane_za32[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svdot_lane_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot[_single]_za32[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, int64_t fpmr) + void svdot[_single]_za32[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, int64_t fpmr) __arm_streaming __arm_inout("za"); - void svdot[_single]_za32[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, int64_t fpmr) + void svdot[_single]_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, int64_t fpmr) __arm_streaming __arm_inout("za"); - void svdot_za32[_fm8]_vg1x2_fpm(uint32_t slice, svfloatm8x2_t zn, svfloatm8x2_t zm, + void svdot_za32[_mf8]_vg1x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svdot_za32[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x4_t zn, svfloatm8x4_t zm, + void svdot_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm, fpm_t fpm)__arm_streaming __arm_inout("za"); ``` @@ -12913,12 +12917,12 @@ Multi-vector 8-bit floating-point dot-product. Multi-vector 8-bit floating-point vertical dot-product. ``` c - void svvdott_lane_za32[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svvdott_lane_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svvdotb_lane_za32[_fm8]_vg1x4_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svvdotb_lane_za32[_mf8]_vg1x4_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -12926,34 +12930,34 @@ Multi-vector 8-bit floating-point vertical dot-product. Multi-vector 8-bit floating-point multiply-add long. ``` c - void svmla_lane_za32[_fm8]_vg4x1_fpm(uint32_t slice, svfloatm8_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svmla_lane_za32[_mf8]_vg4x1_fpm(uint32_t slice, svmfloat8_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm)__arm_streaming __arm_inout("za"); - void svmla_lane_za32[_fm8]_vg4x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svmla_lane_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm)__arm_streaming __arm_inout("za"); - void svmla_lane_za32[_fm8]_vg4x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, uint64_t imm_idx, + void svmla_lane_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, uint64_t imm_idx, fpm_t fpm)__arm_streaming __arm_inout("za"); - void svmla[_single]_za32[_fm8]_vg4x1_fpm(uint32_t slice, svfloatm8_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za32[_mf8]_vg4x1_fpm(uint32_t slice, svmfloat8_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla[_single]_za32[_fm8]_vg4x2_fpm(uint32_t slice, svfloatm8x2_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla[_single]_za32[_fm8]_vg4x4_fpm(uint32_t slice, svfloatm8x4_t zn, - svfloatm8_t zm, fpm_t fpm) + void svmla[_single]_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn, + svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_za32[_fm8]_vg4x2_fpm(uint32_t slice, svfloatm8x2_t zn, svfloatm8x2_t zm, + void svmla_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); - void svmla_za32[_fm8]_vg4x4_fpm(uint32_t slice, svfloatm8x4_t zn, svfloatm8x4_t zm, + void svmla_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -12961,8 +12965,8 @@ Multi-vector 8-bit floating-point multiply-add long. 8-bit floating-point sum of outer products and accumulate. ``` c - void svmopa_za32[_fm8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm, - svfloatm8_t zn, svfloatm8_t zm, fpm_t fpm) + void svmopa_za32[_mf8]_m_fpm(uint64_t tile, svbool_t pn, svbool_t pm, + svmfloat8_t zn, svmfloat8_t zm, fpm_t fpm) __arm_streaming __arm_inout("za"); ``` @@ -13448,7 +13452,7 @@ additional instructions. | `svfloat32_t svset_neonq[_f32](svfloat32_t vec, float32x4_t subvec)` | | `svfloat64_t svset_neonq[_f64](svfloat64_t vec, float64x2_t subvec)` | | `svbfloat16_t svset_neonq[_bf16](svbfloat16_t vec, bfloat16x8_t subvec)` | -| `svfloatm8_t svset_neonq[_fm8](svfloatm8_t vec, floatm8x16_t subvec)` | +| `svmfloat8_t svset_neonq[_mf8](svmfloat8_t vec, mfloat8x16_t subvec)` | ### `svget_neonq` @@ -13469,7 +13473,7 @@ NEON vector. | `float32x4_t svget_neonq[_f32](svfloat32_t vec)` | | `float64x2_t svget_neonq[_f64](svfloat64_t vec)` | | `bfloat16x8_t svget_neonq[_bf16](svbfloat16_t vec)` | -| `floatm8x16_t svget_neonq[_fm8](svfloatm8_t vec)` | +| `mfloat8x16_t svget_neonq[_mf8](svmfloat8_t vec)` | ### `svdup_neonq` @@ -13490,7 +13494,7 @@ duplicated NEON vector `vec`. | `svfloat32_t svdup_neonq[_f32](float32x4_t vec)` | | `svfloat64_t svdup_neonq[_f64](float64x2_t vec)` | | `svbfloat16_t svdup_neonq[_bf16](bfloat16x8_t vec)` | -| `svfloatm8_t svdup_neonq[_fm8](floatm8x16_t vec)` | +| `svmfloat8_t svdup_neonq[_mf8](mfloat8x16_t vec)` | # Future directions diff --git a/neon_intrinsics/advsimd.md b/neon_intrinsics/advsimd.md index 891a0abf..9ea728a7 100644 --- a/neon_intrinsics/advsimd.md +++ b/neon_intrinsics/advsimd.md @@ -2501,10 +2501,10 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | int64x2_t vreinterpretq_s64_p128(poly128_t a) | `a -> Vd.1Q` | `NOP` | `Vd.2D -> result` | `A32/A64` | | float64x2_t vreinterpretq_f64_p128(poly128_t a) | `a -> Vd.1Q` | `NOP` | `Vd.2D -> result` | `A64` | | float16x8_t vreinterpretq_f16_p128(poly128_t a) | `a -> Vd.1Q` | `NOP` | `Vd.8H -> result` | `A32/A64` | -| floatm8x8_t vreinterpret_fm8_u8(uint8x8_t a) | `a -> Vd.8B` | `NOP` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vreinterpretq_fm8_u8(uint8x16_t a) | `a -> Vd.16B` | `NOP` | `Vd.16B -> result` | `A64` | -| uint8x8_t vreinterpret_u8_fm8(floatm8x8_t a) | `a -> Vd.8B` | `NOP` | `Vd.8B -> result` | `A64` | -| uint8x16_t vreinterpretq_u8_fm8(floatm8x16_t a) | `a -> Vd.16B` | `NOP` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vreinterpret_mf8_u8(uint8x8_t a) | `a -> Vd.8B` | `NOP` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vreinterpretq_mf8_u8(uint8x16_t a) | `a -> Vd.16B` | `NOP` | `Vd.16B -> result` | `A64` | +| uint8x8_t vreinterpret_u8_mf8(mfloat8x8_t a) | `a -> Vd.8B` | `NOP` | `Vd.8B -> result` | `A64` | +| uint8x16_t vreinterpretq_u8_mf8(mfloat8x16_t a) | `a -> Vd.16B` | `NOP` | `Vd.16B -> result` | `A64` | ### Move @@ -3057,8 +3057,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vcopyq_lane_p8(
     poly8x16_t a,
     const int lane1,
     poly8x8_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.8B`
`0 <= lane2 <= 7` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | | poly16x4_t vcopy_lane_p16(
     poly16x4_t a,
     const int lane1,
     poly16x4_t b,
     const int lane2)
| `a -> Vd.4H`
`0 <= lane1 <= 3`
`b -> Vn.4H`
`0 <= lane2 <= 3` | `INS Vd.H[lane1],Vn.H[lane2]` | `Vd.4H -> result` | `A64` | | poly16x8_t vcopyq_lane_p16(
     poly16x8_t a,
     const int lane1,
     poly16x4_t b,
     const int lane2)
| `a -> Vd.8H`
`0 <= lane1 <= 7`
`b -> Vn.4H`
`0 <= lane2 <= 3` | `INS Vd.H[lane1],Vn.H[lane2]` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vcopy_lane_fm8(
     floatm8x8_t a,
     const int lane1,
     floatm8x8_t b,
     const int lane2)
| `a -> Vd.8B`
`0 <= lane1 <= 7`
`b -> Vn.8B`
`0 <= lane2 <= 7` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vcopyq_lane_fm8(
     floatm8x16_t a,
     const int lane1,
     floatm8x8_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.8B`
`0 <= lane2 <= 7` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vcopy_lane_mf8(
     mfloat8x8_t a,
     const int lane1,
     mfloat8x8_t b,
     const int lane2)
| `a -> Vd.8B`
`0 <= lane1 <= 7`
`b -> Vn.8B`
`0 <= lane2 <= 7` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vcopyq_lane_mf8(
     mfloat8x16_t a,
     const int lane1,
     mfloat8x8_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.8B`
`0 <= lane2 <= 7` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | | int8x8_t vcopy_laneq_s8(
     int8x8_t a,
     const int lane1,
     int8x16_t b,
     const int lane2)
| `a -> Vd.8B`
`0 <= lane1 <= 7`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.8B -> result` | `A64` | | int8x16_t vcopyq_laneq_s8(
     int8x16_t a,
     const int lane1,
     int8x16_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | | int16x4_t vcopy_laneq_s16(
     int16x4_t a,
     const int lane1,
     int16x8_t b,
     const int lane2)
| `a -> Vd.4H`
`0 <= lane1 <= 3`
`b -> Vn.8H`
`0 <= lane2 <= 7` | `INS Vd.H[lane1],Vn.H[lane2]` | `Vd.4H -> result` | `A64` | @@ -3085,8 +3085,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vcopyq_laneq_p8(
     poly8x16_t a,
     const int lane1,
     poly8x16_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | | poly16x4_t vcopy_laneq_p16(
     poly16x4_t a,
     const int lane1,
     poly16x8_t b,
     const int lane2)
| `a -> Vd.4H`
`0 <= lane1 <= 3`
`b -> Vn.8H`
`0 <= lane2 <= 7` | `INS Vd.H[lane1],Vn.H[lane2]` | `Vd.4H -> result` | `A64` | | poly16x8_t vcopyq_laneq_p16(
     poly16x8_t a,
     const int lane1,
     poly16x8_t b,
     const int lane2)
| `a -> Vd.8H`
`0 <= lane1 <= 7`
`b -> Vn.8H`
`0 <= lane2 <= 7` | `INS Vd.H[lane1],Vn.H[lane2]` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vcopy_laneq_fm8(
     floatm8x8_t a,
     const int lane1,
     floatm8x16_t b,
     const int lane2)
| `a -> Vd.8B`
`0 <= lane1 <= 7`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vcopyq_laneq_fm8(
     floatm8x16_t a,
     const int lane1,
     floatm8x16_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vcopy_laneq_mf8(
     mfloat8x8_t a,
     const int lane1,
     mfloat8x16_t b,
     const int lane2)
| `a -> Vd.8B`
`0 <= lane1 <= 7`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vcopyq_laneq_mf8(
     mfloat8x16_t a,
     const int lane1,
     mfloat8x16_t b,
     const int lane2)
| `a -> Vd.16B`
`0 <= lane1 <= 15`
`b -> Vn.16B`
`0 <= lane2 <= 15` | `INS Vd.B[lane1],Vn.B[lane2]` | `Vd.16B -> result` | `A64` | #### Reverse bits within elements @@ -3117,7 +3117,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x8_t vcreate_p8(uint64_t a) | `a -> Xn` | `INS Vd.D[0],Xn` | `Vd.8B -> result` | `v7/A32/A64` | | poly16x4_t vcreate_p16(uint64_t a) | `a -> Xn` | `INS Vd.D[0],Xn` | `Vd.4H -> result` | `v7/A32/A64` | | float64x1_t vcreate_f64(uint64_t a) | `a -> Xn` | `INS Vd.D[0],Xn` | `Vd.1D -> result` | `A64` | -| floatm8x8_t vcreate_fm8(uint64_t a) | `a -> Xn` | `INS Vd.D[0],Xn` | `Vd.8B -> result` | `v7/A32/A64` | +| mfloat8x8_t vcreate_mf8(uint64_t a) | `a -> Xn` | `INS Vd.D[0],Xn` | `Vd.8B -> result` | `v7/A32/A64` | #### Set all lanes to the same value @@ -3149,8 +3149,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vdupq_n_p16(poly16_t value) | `value -> rn` | `DUP Vd.8H,rn` | `Vd.8H -> result` | `v7/A32/A64` | | float64x1_t vdup_n_f64(float64_t value) | `value -> rn` | `INS Dd.D[0],xn` | `Vd.1D -> result` | `A64` | | float64x2_t vdupq_n_f64(float64_t value) | `value -> rn` | `DUP Vd.2D,rn` | `Vd.2D -> result` | `A64` | -| floatm8x8_t vdup_n_fm8(floatm8_t value) | `value -> rn` | `DUP Vd.8B,rn` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vdupq_n_fm8(floatm8_t value) | `value -> rn` | `DUP Vd.16B,rn` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vdup_n_mf8(mfloat8_t value) | `value -> rn` | `DUP Vd.8B,rn` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vdupq_n_mf8(mfloat8_t value) | `value -> rn` | `DUP Vd.16B,rn` | `Vd.16B -> result` | `A64` | | int8x8_t vmov_n_s8(int8_t value) | `value -> rn` | `DUP Vd.8B,rn` | `Vd.8B -> result` | `v7/A32/A64` | | int8x16_t vmovq_n_s8(int8_t value) | `value -> rn` | `DUP Vd.16B,rn` | `Vd.16B -> result` | `v7/A32/A64` | | int16x4_t vmov_n_s16(int16_t value) | `value -> rn` | `DUP Vd.4H,rn` | `Vd.4H -> result` | `v7/A32/A64` | @@ -3175,8 +3175,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vmovq_n_p16(poly16_t value) | `value -> rn` | `DUP Vd.8H,rn` | `Vd.8H -> result` | `v7/A32/A64` | | float64x1_t vmov_n_f64(float64_t value) | `value -> rn` | `DUP Vd.1D,rn` | `Vd.1D -> result` | `A64` | | float64x2_t vmovq_n_f64(float64_t value) | `value -> rn` | `DUP Vd.2D,rn` | `Vd.2D -> result` | `A64` | -| floatm8x8_t vmov_n_fm8(floatm8_t value) | `value -> rn` | `DUP Vd.8B,rn` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vmovq_n_fm8(floatm8_t value) | `value -> rn` | `DUP Vd.16B,rn` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vmov_n_mf8(mfloat8_t value) | `value -> rn` | `DUP Vd.8B,rn` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vmovq_n_mf8(mfloat8_t value) | `value -> rn` | `DUP Vd.16B,rn` | `Vd.16B -> result` | `A64` | | int8x8_t vdup_lane_s8(
     int8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `v7/A32/A64` | | int8x16_t vdupq_lane_s8(
     int8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `v7/A32/A64` | | int16x4_t vdup_lane_s16(
     int16x4_t vec,
     const int lane)
| `vec -> Vn.4H`
`0 <= lane <= 3` | `DUP Vd.4H,Vn.H[lane]` | `Vd.4H -> result` | `v7/A32/A64` | @@ -3203,8 +3203,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vdupq_lane_p16(
     poly16x4_t vec,
     const int lane)
| `vec -> Vn.4H`
`0 <= lane <= 3` | `DUP Vd.8H,Vn.H[lane]` | `Vd.8H -> result` | `v7/A32/A64` | | float64x1_t vdup_lane_f64(
     float64x1_t vec,
     const int lane)
| `vec -> Vn.1D`
`0 <= lane <= 0` | `DUP Dd,Vn.D[lane]` | `Dd -> result` | `A64` | | float64x2_t vdupq_lane_f64(
     float64x1_t vec,
     const int lane)
| `vec -> Vn.1D`
`0 <= lane <= 0` | `DUP Vd.2D,Vn.D[lane]` | `Vd.2D -> result` | `A64` | -| floatm8x8_t vdup_lane_fm8(
     floatm8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `/A64` | -| floatm8x16_t vdupq_lane_fm8(
     floatm8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vdup_lane_mf8(
     mfloat8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `/A64` | +| mfloat8x16_t vdupq_lane_mf8(
     mfloat8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `A64` | | int8x8_t vdup_laneq_s8(
     int8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `A64` | | int8x16_t vdupq_laneq_s8(
     int8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `A64` | | int16x4_t vdup_laneq_s16(
     int16x8_t vec,
     const int lane)
| `vec -> Vn.8H`
`0 <= lane <= 7` | `DUP Vd.4H,Vn.H[lane]` | `Vd.4H -> result` | `A64` | @@ -3231,8 +3231,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vdupq_laneq_p16(
     poly16x8_t vec,
     const int lane)
| `vec -> Vn.8H`
`0 <= lane <= 7` | `DUP Vd.8H,Vn.H[lane]` | `Vd.8H -> result` | `A64` | | float64x1_t vdup_laneq_f64(
     float64x2_t vec,
     const int lane)
| `vec -> Vn.2D`
`0 <= lane <= 1` | `DUP Dd,Vn.D[lane]` | `Dd -> result` | `A64` | | float64x2_t vdupq_laneq_f64(
     float64x2_t vec,
     const int lane)
| `vec -> Vn.2D`
`0 <= lane <= 1` | `DUP Vd.2D,Vn.D[lane]` | `Vd.2D -> result` | `A64` | -| floatm8x8_t vdup_laneq_fm8(
     floatm8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vdupq_laneq_fm8(
     floatm8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vdup_laneq_mf8(
     mfloat8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.8B,Vn.B[lane]` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vdupq_laneq_mf8(
     mfloat8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Vd.16B,Vn.B[lane]` | `Vd.16B -> result` | `A64` | #### Combine vectors @@ -3252,7 +3252,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vcombine_p8(
     poly8x8_t low,
     poly8x8_t high)
| `low -> Vn.8B`
`high -> Vm.8B` | `DUP Vd.1D,Vn.D[0]`
`INS Vd.D[1],Vm.D[0]` | `Vd.16B -> result` | `v7/A32/A64` | | poly16x8_t vcombine_p16(
     poly16x4_t low,
     poly16x4_t high)
| `low -> Vn.4H`
`high -> Vm.4H` | `DUP Vd.1D,Vn.D[0]`
`INS Vd.D[1],Vm.D[0]` | `Vd.8H -> result` | `v7/A32/A64` | | float64x2_t vcombine_f64(
     float64x1_t low,
     float64x1_t high)
| `low -> Vn.1D`
`high -> Vm.1D` | `DUP Vd.1D,Vn.D[0]`
`INS Vd.D[1],Vm.D[0]` | `Vd.2D -> result` | `A64` | -| floatm8x16_t vcombine_fm8(
     floatm8x8_t low,
     floatm8x8_t high)
| `low -> Vn.8B`
`high -> Vm.8B` | `DUP Vd.1D,Vn.D[0]`
`INS Vd.D[1],Vm.D[0]` | `Vd.16B -> result` | `A64` | +| mfloat8x16_t vcombine_mf8(
     mfloat8x8_t low,
     mfloat8x8_t high)
| `low -> Vn.8B`
`high -> Vm.8B` | `DUP Vd.1D,Vn.D[0]`
`INS Vd.D[1],Vm.D[0]` | `Vd.16B -> result` | `A64` | #### Split vectors @@ -3272,7 +3272,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x8_t vget_high_p8(poly8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[1]` | `Vd.8B -> result` | `v7/A32/A64` | | poly16x4_t vget_high_p16(poly16x8_t a) | `a -> Vn.8H` | `DUP Vd.1D,Vn.D[1]` | `Vd.4H -> result` | `v7/A32/A64` | | float64x1_t vget_high_f64(float64x2_t a) | `a -> Vn.2D` | `DUP Vd.1D,Vn.D[1]` | `Vd.1D -> result` | `A64` | -| floatm8x8_t vget_high_fm8(floatm8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[1]` | `Vd.8B -> result` | `A64` | +| mfloat8x8_t vget_high_mf8(mfloat8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[1]` | `Vd.8B -> result` | `A64` | | int8x8_t vget_low_s8(int8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[0]` | `Vd.8B -> result` | `v7/A32/A64` | | int16x4_t vget_low_s16(int16x8_t a) | `a -> Vn.8H` | `DUP Vd.1D,Vn.D[0]` | `Vd.4H -> result` | `v7/A32/A64` | | int32x2_t vget_low_s32(int32x4_t a) | `a -> Vn.4S` | `DUP Vd.1D,Vn.D[0]` | `Vd.2S -> result` | `v7/A32/A64` | @@ -3287,7 +3287,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x8_t vget_low_p8(poly8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[0]` | `Vd.8B -> result` | `v7/A32/A64` | | poly16x4_t vget_low_p16(poly16x8_t a) | `a -> Vn.8H` | `DUP Vd.1D,Vn.D[0]` | `Vd.4H -> result` | `v7/A32/A64` | | float64x1_t vget_low_f64(float64x2_t a) | `a -> Vn.2D` | `DUP Vd.1D,Vn.D[0]` | `Vd.1D -> result` | `A64` | -| floatm8x8_t vget_low_fm8(floatm8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[0]` | `Vd.8B -> result` | `A64` | +| mfloat8x8_t vget_low_mf8(mfloat8x16_t a) | `a -> Vn.16B` | `DUP Vd.1D,Vn.D[0]` | `Vd.8B -> result` | `A64` | #### Extract one element from vector @@ -3305,7 +3305,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | float64_t vdupd_lane_f64(
     float64x1_t vec,
     const int lane)
| `vec -> Vn.1D`
`0 <= lane <= 0` | `DUP Dd,Vn.D[lane]` | `Dd -> result` | `A64` | | poly8_t vdupb_lane_p8(
     poly8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | | poly16_t vduph_lane_p16(
     poly16x4_t vec,
     const int lane)
| `vec -> Vn.4H`
`0 <= lane <= 3` | `DUP Hd,Vn.H[lane]` | `Hd -> result` | `A64` | -| floatm8_t vdupb_lane_fm8(
     floatm8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | +| mfloat8_t vdupb_lane_mf8(
     mfloat8x8_t vec,
     const int lane)
| `vec -> Vn.8B`
`0 <= lane <= 7` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | | int8_t vdupb_laneq_s8(
     int8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | | int16_t vduph_laneq_s16(
     int16x8_t vec,
     const int lane)
| `vec -> Vn.8H`
`0 <= lane <= 7` | `DUP Hd,Vn.H[lane]` | `Hd -> result` | `A64` | | int32_t vdups_laneq_s32(
     int32x4_t vec,
     const int lane)
| `vec -> Vn.4S`
`0 <= lane <= 3` | `DUP Sd,Vn.S[lane]` | `Sd -> result` | `A64` | @@ -3318,7 +3318,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | float64_t vdupd_laneq_f64(
     float64x2_t vec,
     const int lane)
| `vec -> Vn.2D`
`0 <= lane <= 1` | `DUP Dd,Vn.D[lane]` | `Dd -> result` | `A64` | | poly8_t vdupb_laneq_p8(
     poly8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | | poly16_t vduph_laneq_p16(
     poly16x8_t vec,
     const int lane)
| `vec -> Vn.8H`
`0 <= lane <= 7` | `DUP Hd,Vn.H[lane]` | `Hd -> result` | `A64` | -| floatm8_t vdupb_laneq_fm8(
     floatm8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | +| mfloat8_t vdupb_laneq_mf8(
     mfloat8x16_t vec,
     const int lane)
| `vec -> Vn.16B`
`0 <= lane <= 15` | `DUP Bd,Vn.B[lane]` | `Bd -> result` | `A64` | | uint8_t vget_lane_u8(
     uint8x8_t v,
     const int lane)
| `0<=lane<=7`
`v -> Vn.8B` | `UMOV Rd,Vn.B[lane]` | `Rd -> result` | `v7/A32/A64` | | uint16_t vget_lane_u16(
     uint16x4_t v,
     const int lane)
| `0<=lane<=3`
`v -> Vn.4H` | `UMOV Rd,Vn.H[lane]` | `Rd -> result` | `v7/A32/A64` | | uint32_t vget_lane_u32(
     uint32x2_t v,
     const int lane)
| `0<=lane<=1`
`v -> Vn.2S` | `UMOV Rd,Vn.S[lane]` | `Rd -> result` | `v7/A32/A64` | @@ -3378,8 +3378,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vextq_p8(
     poly8x16_t a,
     poly8x16_t b,
     const int n)
| `a -> Vn.16B`
`b -> Vm.16B`
`0 <= n <= 15` | `EXT Vd.16B,Vn.16B,Vm.16B,#n` | `Vd.16B -> result` | `v7/A32/A64` | | poly16x4_t vext_p16(
     poly16x4_t a,
     poly16x4_t b,
     const int n)
| `a -> Vn.8B`
`b -> Vm.8B`
`0 <= n <= 3` | `EXT Vd.8B,Vn.8B,Vm.8B,#(n<<1)` | `Vd.8B -> result` | `v7/A32/A64` | | poly16x8_t vextq_p16(
     poly16x8_t a,
     poly16x8_t b,
     const int n)
| `a -> Vn.16B`
`b -> Vm.16B`
`0 <= n <= 7` | `EXT Vd.16B,Vn.16B,Vm.16B,#(n<<1)` | `Vd.16B -> result` | `v7/A32/A64` | -| floatm8x8_t vext_fm8(
     floatm8x8_t a,
     floatm8x8_t b,
     const int n)
| `a -> Vn.8B`
`b -> Vm.8B`
`0 <= n <= 7` | `EXT Vd.8B,Vn.8B,Vm.8B,#n` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vextq_fm8(
     floatm8x16_t a,
     floatm8x16_t b,
     const int n)
| `a -> Vn.16B`
`b -> Vm.16B`
`0 <= n <= 15` | `EXT Vd.16B,Vn.16B,Vm.16B,#n` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vext_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b,
     const int n)
| `a -> Vn.8B`
`b -> Vm.8B`
`0 <= n <= 7` | `EXT Vd.8B,Vn.8B,Vm.8B,#n` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vextq_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b,
     const int n)
| `a -> Vn.16B`
`b -> Vm.16B`
`0 <= n <= 15` | `EXT Vd.16B,Vn.16B,Vm.16B,#n` | `Vd.16B -> result` | `A64` | #### Reverse elements @@ -3403,8 +3403,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vrev64q_p8(poly8x16_t vec) | `vec -> Vn.16B` | `REV64 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | | poly16x4_t vrev64_p16(poly16x4_t vec) | `vec -> Vn.4H` | `REV64 Vd.4H,Vn.4H` | `Vd.4H -> result` | `v7/A32/A64` | | poly16x8_t vrev64q_p16(poly16x8_t vec) | `vec -> Vn.8H` | `REV64 Vd.8H,Vn.8H` | `Vd.8H -> result` | `v7/A32/A64` | -| floatm8x8_t vrev64_fm8(floatm8x8_t vec) | `vec -> Vn.8B` | `REV64 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vrev64q_fm8(floatm8x16_t vec) | `vec -> Vn.16B` | `REV64 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vrev64_mf8(mfloat8x8_t vec) | `vec -> Vn.8B` | `REV64 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vrev64q_mf8(mfloat8x16_t vec) | `vec -> Vn.16B` | `REV64 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | | int8x8_t vrev32_s8(int8x8_t vec) | `vec -> Vn.8B` | `REV32 Vd.8B,Vn.8B` | `Vd.8B -> result` | `v7/A32/A64` | | int8x16_t vrev32q_s8(int8x16_t vec) | `vec -> Vn.16B` | `REV32 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | | int16x4_t vrev32_s16(int16x4_t vec) | `vec -> Vn.4H` | `REV32 Vd.4H,Vn.4H` | `Vd.4H -> result` | `v7/A32/A64` | @@ -3417,16 +3417,16 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vrev32q_p8(poly8x16_t vec) | `vec -> Vn.16B` | `REV32 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | | poly16x4_t vrev32_p16(poly16x4_t vec) | `vec -> Vn.4H` | `REV32 Vd.4H,Vn.4H` | `Vd.4H -> result` | `v7/A32/A64` | | poly16x8_t vrev32q_p16(poly16x8_t vec) | `vec -> Vn.8H` | `REV32 Vd.8H,Vn.8H` | `Vd.8H -> result` | `v7/A32/A64` | -| floatm8x8_t vrev32_fm8(floatm8x8_t vec) | `vec -> Vn.8B` | `REV32 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vrev32q_fm8(floatm8x16_t vec) | `vec -> Vn.16B` | `REV32 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vrev32_mf8(mfloat8x8_t vec) | `vec -> Vn.8B` | `REV32 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vrev32q_mf8(mfloat8x16_t vec) | `vec -> Vn.16B` | `REV32 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | | int8x8_t vrev16_s8(int8x8_t vec) | `vec -> Vn.8B` | `REV16 Vd.8B,Vn.8B` | `Vd.8B -> result` | `v7/A32/A64` | | int8x16_t vrev16q_s8(int8x16_t vec) | `vec -> Vn.16B` | `REV16 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | | uint8x8_t vrev16_u8(uint8x8_t vec) | `vec -> Vn.8B` | `REV16 Vd.8B,Vn.8B` | `Vd.8B -> result` | `v7/A32/A64` | | uint8x16_t vrev16q_u8(uint8x16_t vec) | `vec -> Vn.16B` | `REV16 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | | poly8x8_t vrev16_p8(poly8x8_t vec) | `vec -> Vn.8B` | `REV16 Vd.8B,Vn.8B` | `Vd.8B -> result` | `v7/A32/A64` | | poly8x16_t vrev16q_p8(poly8x16_t vec) | `vec -> Vn.16B` | `REV16 Vd.16B,Vn.16B` | `Vd.16B -> result` | `v7/A32/A64` | -| floatm8x8_t vrev16_fm8(floatm8x8_t vec) | `vec -> Vn.8B` | `REV16 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vrev16q_fm8(floatm8x16_t vec) | `vec -> Vn.16B` | `REV16 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vrev16_mf8(mfloat8x8_t vec) | `vec -> Vn.8B` | `REV16 Vd.8B,Vn.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vrev16q_mf8(mfloat8x16_t vec) | `vec -> Vn.16B` | `REV16 Vd.16B,Vn.16B` | `Vd.16B -> result` | `A64` | #### Zip elements @@ -3454,8 +3454,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vzip1q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vzip1_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `ZIP1 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vzip1q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `ZIP1 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vzip1_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vzip1q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vzip1_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vzip1q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8_t vzip2_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | | int8x16_t vzip2q_s8(
     int8x16_t a,
     int8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int16x4_t vzip2_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `ZIP2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | @@ -3478,8 +3478,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vzip2q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vzip2_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `ZIP2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vzip2q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `ZIP2 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vzip2_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vzip2q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vzip2_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vzip2q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `ZIP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8x2_t vzip_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP1 Vd1.8B,Vn.8B,Vm.8B`
`ZIP2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `v7/A32/A64` | | int16x4x2_t vzip_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `ZIP1 Vd1.4H,Vn.4H,Vm.4H`
`ZIP2 Vd2.4H,Vn.4H,Vm.4H` | `Vd1.4H -> result.val[0]`
`Vd2.4H -> result.val[1]` | `v7/A32/A64` | | uint8x8x2_t vzip_u8(
     uint8x8_t a,
     uint8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `ZIP1 Vd1.8B,Vn.8B,Vm.8B`
`ZIP2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `v7/A32/A64` | @@ -3525,8 +3525,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vuzp1q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vuzp1_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `UZP1 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vuzp1q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `UZP1 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vuzp1_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vuzp1q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vuzp1_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vuzp1q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8_t vuzp2_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | | int8x16_t vuzp2q_s8(
     int8x16_t a,
     int8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int16x4_t vuzp2_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `UZP2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | @@ -3549,8 +3549,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vuzp2q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vuzp2_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `UZP2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vuzp2q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `UZP2 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vuzp2_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vuzp2q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vuzp2_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vuzp2q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `UZP2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8x2_t vuzp_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `UZP1 Vd1.8B,Vn.8B,Vm.8B`
`UZP2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `v7/A32/A64` | | int16x4x2_t vuzp_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `UZP1 Vd1.4H,Vn.4H,Vm.4H`
`UZP2 Vd2.4H,Vn.4H,Vm.4H` | `Vd1.4H -> result.val[0]`
`Vd2.4H -> result.val[1]` | `v7/A32/A64` | | int32x2x2_t vuzp_s32(
     int32x2_t a,
     int32x2_t b)
| `a -> Vn.2S`
`b -> Vm.2S` | `UZP1 Vd1.2S,Vn.2S,Vm.2S`
`UZP2 Vd2.2S,Vn.2S,Vm.2S` | `Vd1.2S -> result.val[0]`
`Vd2.2S -> result.val[1]` | `v7/A32/A64` | @@ -3596,8 +3596,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vtrn1q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vtrn1_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `TRN1 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vtrn1q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `TRN1 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vtrn1_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vtrn1q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vtrn1_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vtrn1q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8_t vtrn2_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | | int8x16_t vtrn2q_s8(
     int8x16_t a,
     int8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int16x4_t vtrn2_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `TRN2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | @@ -3620,8 +3620,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly8x16_t vtrn2q_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | poly16x4_t vtrn2_p16(
     poly16x4_t a,
     poly16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `TRN2 Vd.4H,Vn.4H,Vm.4H` | `Vd.4H -> result` | `A64` | | poly16x8_t vtrn2q_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `TRN2 Vd.8H,Vn.8H,Vm.8H` | `Vd.8H -> result` | `A64` | -| floatm8x8_t vtrn2_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vtrn2q_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vtrn2_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN2 Vd.8B,Vn.8B,Vm.8B` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vtrn2q_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN2 Vd.16B,Vn.16B,Vm.16B` | `Vd.16B -> result` | `A64` | | int8x8x2_t vtrn_s8(
     int8x8_t a,
     int8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd1.8B,Vn.8B,Vm.8B`
`TRN2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `v7/A32/A64` | | int16x4x2_t vtrn_s16(
     int16x4_t a,
     int16x4_t b)
| `a -> Vn.4H`
`b -> Vm.4H` | `TRN1 Vd1.4H,Vn.4H,Vm.4H`
`TRN2 Vd2.4H,Vn.4H,Vm.4H` | `Vd1.4H -> result.val[0]`
`Vd2.4H -> result.val[1]` | `v7/A32/A64` | | uint8x8x2_t vtrn_u8(
     uint8x8_t a,
     uint8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd1.8B,Vn.8B,Vm.8B`
`TRN2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `v7/A32/A64` | @@ -3631,7 +3631,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | int32x2x2_t vtrn_s32(
     int32x2_t a,
     int32x2_t b)
| `a -> Vn.2S`
`b -> Vm.2S` | `TRN1 Vd1.2S,Vn.2S,Vm.2S`
`TRN2 Vd2.2S,Vn.2S,Vm.2S` | `Vd1.2S -> result.val[0]`
`Vd2.2S -> result.val[1]` | `v7/A32/A64` | | float32x2x2_t vtrn_f32(
     float32x2_t a,
     float32x2_t b)
| `a -> Vn.2S`
`b -> Vm.2S` | `TRN1 Vd1.2S,Vn.2S,Vm.2S`
`TRN2 Vd2.2S,Vn.2S,Vm.2S` | `Vd1.2S -> result.val[0]`
`Vd2.2S -> result.val[1]` | `v7/A32/A64` | | uint32x2x2_t vtrn_u32(
     uint32x2_t a,
     uint32x2_t b)
| `a -> Vn.2S`
`b -> Vm.2S` | `TRN1 Vd1.2S,Vn.2S,Vm.2S`
`TRN2 Vd2.2S,Vn.2S,Vm.2S` | `Vd1.2S -> result.val[0]`
`Vd2.2S -> result.val[1]` | `v7/A32/A64` | -| floatm8x8x2_t vtrn_fm8(
     floatm8x8_t a,
     floatm8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd1.8B,Vn.8B,Vm.8B`
`TRN2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `A64` | +| mfloat8x8x2_t vtrn_mf8(
     mfloat8x8_t a,
     mfloat8x8_t b)
| `a -> Vn.8B`
`b -> Vm.8B` | `TRN1 Vd1.8B,Vn.8B,Vm.8B`
`TRN2 Vd2.8B,Vn.8B,Vm.8B` | `Vd1.8B -> result.val[0]`
`Vd2.8B -> result.val[1]` | `A64` | | int8x16x2_t vtrnq_s8(
     int8x16_t a,
     int8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd1.16B,Vn.16B,Vm.16B`
`TRN2 Vd2.16B,Vn.16B,Vm.16B` | `Vd1.16B -> result.val[0]`
`Vd2.16B -> result.val[1]` | `v7/A32/A64` | | int16x8x2_t vtrnq_s16(
     int16x8_t a,
     int16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `TRN1 Vd1.8H,Vn.8H,Vm.8H`
`TRN2 Vd2.8H,Vn.8H,Vm.8H` | `Vd1.8H -> result.val[0]`
`Vd2.8H -> result.val[1]` | `v7/A32/A64` | | int32x4x2_t vtrnq_s32(
     int32x4_t a,
     int32x4_t b)
| `a -> Vn.4S`
`b -> Vm.4S` | `TRN1 Vd1.4S,Vn.4S,Vm.4S`
`TRN2 Vd2.4S,Vn.4S,Vm.4S` | `Vd1.4S -> result.val[0]`
`Vd2.4S -> result.val[1]` | `v7/A32/A64` | @@ -3641,7 +3641,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | uint32x4x2_t vtrnq_u32(
     uint32x4_t a,
     uint32x4_t b)
| `a -> Vn.4S`
`b -> Vm.4S` | `TRN1 Vd1.4S,Vn.4S,Vm.4S`
`TRN2 Vd2.4S,Vn.4S,Vm.4S` | `Vd1.4S -> result.val[0]`
`Vd2.4S -> result.val[1]` | `v7/A32/A64` | | poly8x16x2_t vtrnq_p8(
     poly8x16_t a,
     poly8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd1.16B,Vn.16B,Vm.16B`
`TRN2 Vd2.16B,Vn.16B,Vm.16B` | `Vd1.16B -> result.val[0]`
`Vd2.16B -> result.val[1]` | `v7/A32/A64` | | poly16x8x2_t vtrnq_p16(
     poly16x8_t a,
     poly16x8_t b)
| `a -> Vn.8H`
`b -> Vm.8H` | `TRN1 Vd1.8H,Vn.8H,Vm.8H`
`TRN2 Vd2.8H,Vn.8H,Vm.8H` | `Vd1.8H -> result.val[0]`
`Vd2.8H -> result.val[1]` | `v7/A32/A64` | -| floatm8x16x2_t vtrnq_fm8(
     floatm8x16_t a,
     floatm8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd1.16B,Vn.16B,Vm.16B`
`TRN2 Vd2.16B,Vn.16B,Vm.16B` | `Vd1.16B -> result.val[0]`
`Vd2.16B -> result.val[1]` | `A64` | +| mfloat8x16x2_t vtrnq_mf8(
     mfloat8x16_t a,
     mfloat8x16_t b)
| `a -> Vn.16B`
`b -> Vm.16B` | `TRN1 Vd1.16B,Vn.16B,Vm.16B`
`TRN2 Vd2.16B,Vn.16B,Vm.16B` | `Vd1.16B -> result.val[0]`
`Vd2.16B -> result.val[1]` | `A64` | #### Set vector lane @@ -3662,7 +3662,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | float16x8_t vsetq_lane_f16(
     float16_t a,
     float16x8_t v,
     const int lane)
| `0<=lane<=7`
`a -> VnH`
`v -> Vd.8H` | `MOV Vd.H[lane],Vn.H[0]` | `Vd.8H -> result` | `v7/A32/A64` | | float32x2_t vset_lane_f32(
     float32_t a,
     float32x2_t v,
     const int lane)
| `0<=lane<=1`
`a -> Rn`
`v -> Vd.2S` | `MOV Vd.S[lane],Rn` | `Vd.2S -> result` | `v7/A32/A64` | | float64x1_t vset_lane_f64(
     float64_t a,
     float64x1_t v,
     const int lane)
| `lane==0`
`a -> Rn`
`v -> Vd.1D` | `MOV Vd.D[lane],Rn` | `Vd.1D -> result` | `A64` | -| floatm8x8_t vset_lane_fm8(
     floatm8_t a,
     floatm8x8_t v,
     const int lane)
| `0<=lane<=7`
`a -> Rn`
`v -> Vd.8B` | `MOV Vd.B[lane],Rn` | `Vd.8B -> result` | `A64` | +| mfloat8x8_t vset_lane_mf8(
     mfloat8_t a,
     mfloat8x8_t v,
     const int lane)
| `0<=lane<=7`
`a -> Rn`
`v -> Vd.8B` | `MOV Vd.B[lane],Rn` | `Vd.8B -> result` | `A64` | | uint8x16_t vsetq_lane_u8(
     uint8_t a,
     uint8x16_t v,
     const int lane)
| `0<=lane<=15`
`a -> Rn`
`v -> Vd.16B` | `MOV Vd.B[lane],Rn` | `Vd.16B -> result` | `v7/A32/A64` | | uint16x8_t vsetq_lane_u16(
     uint16_t a,
     uint16x8_t v,
     const int lane)
| `0<=lane<=7`
`a -> Rn`
`v -> Vd.8H` | `MOV Vd.H[lane],Rn` | `Vd.8H -> result` | `v7/A32/A64` | | uint32x4_t vsetq_lane_u32(
     uint32_t a,
     uint32x4_t v,
     const int lane)
| `0<=lane<=3`
`a -> Rn`
`v -> Vd.4S` | `MOV Vd.S[lane],Rn` | `Vd.4S -> result` | `v7/A32/A64` | @@ -3676,7 +3676,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vsetq_lane_p16(
     poly16_t a,
     poly16x8_t v,
     const int lane)
| `0<=lane<=7`
`a -> Rn`
`v -> Vd.8H` | `MOV Vd.H[lane],Rn` | `Vd.8H -> result` | `v7/A32/A64` | | float32x4_t vsetq_lane_f32(
     float32_t a,
     float32x4_t v,
     const int lane)
| `0<=lane<=3`
`a -> Rn`
`v -> Vd.4S` | `MOV Vd.S[lane],Rn` | `Vd.4S -> result` | `v7/A32/A64` | | float64x2_t vsetq_lane_f64(
     float64_t a,
     float64x2_t v,
     const int lane)
| `0<=lane<=1`
`a -> Rn`
`v -> Vd.2D` | `MOV Vd.D[lane],Rn` | `Vd.2D -> result` | `A64` | -| floatm8x16_t vsetq_lane_fm8(
     floatm8_t a,
     floatm8x16_t v,
     const int lane)
| `0<=lane<=15`
`a -> Rn`
`v -> Vd.16B` | `MOV Vd.B[lane],Rn` | `Vd.16B -> result` | `A64` | +| mfloat8x16_t vsetq_lane_mf8(
     mfloat8_t a,
     mfloat8x16_t v,
     const int lane)
| `0<=lane<=15`
`a -> Rn`
`v -> Vd.16B` | `MOV Vd.B[lane],Rn` | `Vd.16B -> result` | `A64` | ### Load @@ -3712,8 +3712,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vld1q_p16(poly16_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8H},[Xn]` | `Vt.8H -> result` | `v7/A32/A64` | | float64x1_t vld1_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D},[Xn]` | `Vt.1D -> result` | `A64` | | float64x2_t vld1q_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D},[Xn]` | `Vt.2D -> result` | `A64` | -| floatm8x8_t vld1_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B},[Xn]` | `Vt.8B -> result` | `A64` | -| floatm8x16_t vld1q_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B},[Xn]` | `Vt.16B -> result` | `A64` | +| mfloat8x8_t vld1_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B},[Xn]` | `Vt.8B -> result` | `A64` | +| mfloat8x16_t vld1q_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B},[Xn]` | `Vt.16B -> result` | `A64` | | int8x8_t vld1_lane_s8(
     int8_t const *ptr,
     int8x8_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.8B`
`0 <= lane <= 7` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.8B -> result` | `v7/A32/A64` | | int8x16_t vld1q_lane_s8(
     int8_t const *ptr,
     int8x16_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.16B`
`0 <= lane <= 15` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.16B -> result` | `v7/A32/A64` | | int16x4_t vld1_lane_s16(
     int16_t const *ptr,
     int16x4_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.4H`
`0 <= lane <= 3` | `LD1 {Vt.H}[lane],[Xn]` | `Vt.4H -> result` | `v7/A32/A64` | @@ -3742,8 +3742,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vld1q_lane_p16(
     poly16_t const *ptr,
     poly16x8_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.8H`
`0 <= lane <= 7` | `LD1 {Vt.H}[lane],[Xn]` | `Vt.8H -> result` | `v7/A32/A64` | | float64x1_t vld1_lane_f64(
     float64_t const *ptr,
     float64x1_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.1D`
`0 <= lane <= 0` | `LD1 {Vt.D}[lane],[Xn]` | `Vt.1D -> result` | `A64` | | float64x2_t vld1q_lane_f64(
     float64_t const *ptr,
     float64x2_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.2D`
`0 <= lane <= 1` | `LD1 {Vt.D}[lane],[Xn]` | `Vt.2D -> result` | `A64` | -| floatm8x8_t vld1_lane_fm8(
     floatm8_t const *ptr,
     floatm8x8_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.8B`
`0 <= lane <= 7` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.8B -> result` | `A64` | -| floatm8x16_t vld1q_lane_fm8(
     floatm8_t const *ptr,
     floatm8x16_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.16B`
`0 <= lane <= 15` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.16B -> result` | `A64` | +| mfloat8x8_t vld1_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x8_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.8B`
`0 <= lane <= 7` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.8B -> result` | `A64` | +| mfloat8x16_t vld1q_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x16_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.16B`
`0 <= lane <= 15` | `LD1 {Vt.b}[lane],[Xn]` | `Vt.16B -> result` | `A64` | | uint64x1_t vldap1_lane_u64(
     uint64_t const *ptr,
     uint64x1_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.1D`
`0 <= lane <= 0` | `LDAP1 {Vt.D}[lane],[Xn]` | `Vt.1D -> result` | `A64` | | uint64x2_t vldap1q_lane_u64(
     uint64_t const *ptr,
     uint64x2_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.2D`
`0 <= lane <= 1` | `LDAP1 {Vt.D}[lane],[Xn]` | `Vt.2D -> result` | `A64` | | int64x1_t vldap1_lane_s64(
     int64_t const *ptr,
     int64x1_t src,
     const int lane)
| `ptr -> Xn`
`src -> Vt.1D`
`0 <= lane <= 0` | `LDAP1 {Vt.D}[lane],[Xn]` | `Vt.1D -> result` | `A64` | @@ -3780,8 +3780,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly16x8_t vld1q_dup_p16(poly16_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.8H},[Xn]` | `Vt.8H -> result` | `v7/A32/A64` | | float64x1_t vld1_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D},[Xn]` | `Vt.1D -> result` | `A64` | | float64x2_t vld1q_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.2D},[Xn]` | `Vt.2D -> result` | `A64` | -| floatm8x8_t vld1_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.8B},[Xn]` | `Vt.8B -> result` | `A64` | -| floatm8x16_t vld1q_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.16B},[Xn]` | `Vt.16B -> result` | `A64` | +| mfloat8x8_t vld1_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.8B},[Xn]` | `Vt.8B -> result` | `A64` | +| mfloat8x16_t vld1q_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1R {Vt.16B},[Xn]` | `Vt.16B -> result` | `A64` | | void vstl1_lane_u64(
     uint64_t *ptr,
     uint64x1_t val,
     const int lane)
| `val -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `STL1 {Vt.d}[lane],[Xn]` | | `A64` | | void vstl1q_lane_u64(
     uint64_t *ptr,
     uint64x2_t val,
     const int lane)
| `val -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `STL1 {Vt.d}[lane],[Xn]` | | `A64` | | void vstl1_lane_s64(
     int64_t *ptr,
     int64x1_t val,
     const int lane)
| `val -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `STL1 {Vt.d}[lane],[Xn]` | | `A64` | @@ -3818,8 +3818,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x2_t vld2q_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x2_t vld2_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt2.1D},[Xn]` | `Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x2_t vld2q_f64(float64_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x2_t vld2_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x2_t vld2q_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x2_t vld2_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x2_t vld2q_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD2 {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x3_t vld3_s8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x3_t vld3q_s8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x3_t vld3_s16(int16_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.4H - Vt3.4H},[Xn]` | `Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -3848,8 +3848,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x3_t vld3q_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x3_t vld3_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt3.1D},[Xn]` | `Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x3_t vld3q_f64(float64_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x3_t vld3_fm8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x3_t vld3q_fm8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x3_t vld3_mf8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x3_t vld3q_mf8(int8_t const *ptr) | `ptr -> Xn` | `LD3 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x4_t vld4_s8(int8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x4_t vld4q_s8(int8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x4_t vld4_s16(int16_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.4H - Vt4.4H},[Xn]` | `Vt4.4H -> result.val[3]`
`Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -3878,8 +3878,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x4_t vld4q_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x4_t vld4_f64(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt4.1D},[Xn]` | `Vt4.1D -> result.val[3]`
`Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x4_t vld4q_f64(float64_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x4_t vld4_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x4_t vld4q_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x4_t vld4_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x4_t vld4q_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD4 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x2_t vld2_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x2_t vld2q_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x2_t vld2_dup_s16(int16_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.4H - Vt2.4H},[Xn]` | `Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -3908,8 +3908,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x2_t vld2q_dup_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x2_t vld2_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.1D - Vt2.1D},[Xn]` | `Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x2_t vld2q_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x2_t vld2_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x2_t vld2q_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x2_t vld2_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x2_t vld2q_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD2R {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x3_t vld3_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x3_t vld3q_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x3_t vld3_dup_s16(int16_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.4H - Vt3.4H},[Xn]` | `Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -3938,8 +3938,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x3_t vld3q_dup_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x3_t vld3_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.1D - Vt3.1D},[Xn]` | `Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x3_t vld3q_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x3_t vld3_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x3_t vld3q_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x3_t vld3_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x3_t vld3q_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD3R {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x4_t vld4_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x4_t vld4q_dup_s8(int8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x4_t vld4_dup_s16(int16_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.4H - Vt4.4H},[Xn]` | `Vt4.4H -> result.val[3]`
`Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -3968,8 +3968,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x4_t vld4q_dup_p64(poly64_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x4_t vld4_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.1D - Vt4.1D},[Xn]` | `Vt4.1D -> result.val[3]`
`Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x4_t vld4q_dup_f64(float64_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x4_t vld4_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x4_t vld4q_dup_fm8(floatm8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x4_t vld4_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x4_t vld4q_dup_mf8(mfloat8_t const *ptr) | `ptr -> Xn` | `LD4R {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int16x4x2_t vld2_lane_s16(
     int16_t const *ptr,
     int16x4x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.4H`
`src.val[0] -> Vt.4H`
`0 <= lane <= 3` | `LD2 {Vt.h - Vt2.h}[lane],[Xn]` | `Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | | int16x8x2_t vld2q_lane_s16(
     int16_t const *ptr,
     int16x8x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.8H`
`src.val[0] -> Vt.8H`
`0 <= lane <= 7` | `LD2 {Vt.h - Vt2.h}[lane],[Xn]` | `Vt2.8H -> result.val[1]`
`Vt.8H -> result.val[0]` | `v7/A32/A64` | | int32x2x2_t vld2_lane_s32(
     int32_t const *ptr,
     int32x2x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.2S`
`src.val[0] -> Vt.2S`
`0 <= lane <= 1` | `LD2 {Vt.s - Vt2.s}[lane],[Xn]` | `Vt2.2S -> result.val[1]`
`Vt.2S -> result.val[0]` | `v7/A32/A64` | @@ -3998,8 +3998,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x2_t vld2q_lane_p64(
     poly64_t const *ptr,
     poly64x2x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD2 {Vt.d - Vt2.d}[lane],[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x2_t vld2_lane_f64(
     float64_t const *ptr,
     float64x1x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.1D`
`src.val[0] -> Vt.1D`
`0 <= lane <= 0` | `LD2 {Vt.d - Vt2.d}[lane],[Xn]` | `Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x2_t vld2q_lane_f64(
     float64_t const *ptr,
     float64x2x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD2 {Vt.d - Vt2.d}[lane],[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x2_t vld2_lane_fm8(
     floatm8_t const *ptr,
     floatm8x8x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD2 {Vt.b - Vt2.b}[lane],[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x2_t vld2q_lane_fm8(
     floatm8_t const *ptr,
     floatm8x16x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD2 {Vt.b - Vt2.b}[lane],[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x2_t vld2_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x8x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD2 {Vt.b - Vt2.b}[lane],[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x2_t vld2q_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x16x2_t src,
     const int lane)
| `ptr -> Xn`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD2 {Vt.b - Vt2.b}[lane],[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int16x4x3_t vld3_lane_s16(
     int16_t const *ptr,
     int16x4x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.4H`
`src.val[1] -> Vt2.4H`
`src.val[0] -> Vt.4H`
`0 <= lane <= 3` | `LD3 {Vt.h - Vt3.h}[lane],[Xn]` | `Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | | int16x8x3_t vld3q_lane_s16(
     int16_t const *ptr,
     int16x8x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.8H`
`src.val[1] -> Vt2.8H`
`src.val[0] -> Vt.8H`
`0 <= lane <= 7` | `LD3 {Vt.h - Vt3.h}[lane],[Xn]` | `Vt3.8H -> result.val[2]`
`Vt2.8H -> result.val[1]`
`Vt.8H -> result.val[0]` | `v7/A32/A64` | | int32x2x3_t vld3_lane_s32(
     int32_t const *ptr,
     int32x2x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.2S`
`src.val[1] -> Vt2.2S`
`src.val[0] -> Vt.2S`
`0 <= lane <= 1` | `LD3 {Vt.s - Vt3.s}[lane],[Xn]` | `Vt3.2S -> result.val[2]`
`Vt2.2S -> result.val[1]`
`Vt.2S -> result.val[0]` | `v7/A32/A64` | @@ -4028,8 +4028,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x3_t vld3q_lane_p64(
     poly64_t const *ptr,
     poly64x2x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.2D`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD3 {Vt.d - Vt3.d}[lane],[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x3_t vld3_lane_f64(
     float64_t const *ptr,
     float64x1x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.1D`
`src.val[1] -> Vt2.1D`
`src.val[0] -> Vt.1D`
`0 <= lane <= 0` | `LD3 {Vt.d - Vt3.d}[lane],[Xn]` | `Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x3_t vld3q_lane_f64(
     float64_t const *ptr,
     float64x2x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.2D`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD3 {Vt.d - Vt3.d}[lane],[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x3_t vld3_lane_fm8(
     floatm8_t const *ptr,
     floatm8x8x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.8B`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD3 {Vt.b - Vt3.b}[lane],[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x3_t vld3q_lane_fm8(
     floatm8_t const *ptr,
     floatm8x16x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.16B`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD3 {Vt.b - Vt3.b}[lane],[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x3_t vld3_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x8x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.8B`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD3 {Vt.b - Vt3.b}[lane],[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x3_t vld3q_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x16x3_t src,
     const int lane)
| `ptr -> Xn`
`src.val[2] -> Vt3.16B`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD3 {Vt.b - Vt3.b}[lane],[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int16x4x4_t vld4_lane_s16(
     int16_t const *ptr,
     int16x4x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.4H`
`src.val[2] -> Vt3.4H`
`src.val[1] -> Vt2.4H`
`src.val[0] -> Vt.4H`
`0 <= lane <= 3` | `LD4 {Vt.h - Vt4.h}[lane],[Xn]` | `Vt4.4H -> result.val[3]`
`Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | | int16x8x4_t vld4q_lane_s16(
     int16_t const *ptr,
     int16x8x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.8H`
`src.val[2] -> Vt3.8H`
`src.val[1] -> Vt2.8H`
`src.val[0] -> Vt.8H`
`0 <= lane <= 7` | `LD4 {Vt.h - Vt4.h}[lane],[Xn]` | `Vt4.8H -> result.val[3]`
`Vt3.8H -> result.val[2]`
`Vt2.8H -> result.val[1]`
`Vt.8H -> result.val[0]` | `v7/A32/A64` | | int32x2x4_t vld4_lane_s32(
     int32_t const *ptr,
     int32x2x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.2S`
`src.val[2] -> Vt3.2S`
`src.val[1] -> Vt2.2S`
`src.val[0] -> Vt.2S`
`0 <= lane <= 1` | `LD4 {Vt.s - Vt4.s}[lane],[Xn]` | `Vt4.2S -> result.val[3]`
`Vt3.2S -> result.val[2]`
`Vt2.2S -> result.val[1]`
`Vt.2S -> result.val[0]` | `v7/A32/A64` | @@ -4058,8 +4058,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x4_t vld4q_lane_p64(
     poly64_t const *ptr,
     poly64x2x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.2D`
`src.val[2] -> Vt3.2D`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD4 {Vt.d - Vt4.d}[lane],[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | | float64x1x4_t vld4_lane_f64(
     float64_t const *ptr,
     float64x1x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.1D`
`src.val[2] -> Vt3.1D`
`src.val[1] -> Vt2.1D`
`src.val[0] -> Vt.1D`
`0 <= lane <= 0` | `LD4 {Vt.d - Vt4.d}[lane],[Xn]` | `Vt4.1D -> result.val[3]`
`Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x4_t vld4q_lane_f64(
     float64_t const *ptr,
     float64x2x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.2D`
`src.val[2] -> Vt3.2D`
`src.val[1] -> Vt2.2D`
`src.val[0] -> Vt.2D`
`0 <= lane <= 1` | `LD4 {Vt.d - Vt4.d}[lane],[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x4_t vld4_lane_fm8(
     floatm8_t const *ptr,
     floatm8x8x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.8B`
`src.val[2] -> Vt3.8B`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD4 {Vt.b - Vt4.b}[lane],[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x4_t vld4q_lane_fm8(
     floatm8_t const *ptr,
     floatm8x16x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.16B`
`src.val[2] -> Vt3.16B`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD4 {Vt.b - Vt4.b}[lane],[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x4_t vld4_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x8x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.8B`
`src.val[2] -> Vt3.8B`
`src.val[1] -> Vt2.8B`
`src.val[0] -> Vt.8B`
`0 <= lane <= 7` | `LD4 {Vt.b - Vt4.b}[lane],[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x4_t vld4q_lane_mf8(
     mfloat8_t const *ptr,
     mfloat8x16x4_t src,
     const int lane)
| `ptr -> Xn`
`src.val[3] -> Vt4.16B`
`src.val[2] -> Vt3.16B`
`src.val[1] -> Vt2.16B`
`src.val[0] -> Vt.16B`
`0 <= lane <= 15` | `LD4 {Vt.b - Vt4.b}[lane],[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x2_t vld1_s8_x2(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x2_t vld1q_s8_x2(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x2_t vld1_s16_x2(int16_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.4H - Vt2.4H},[Xn]` | `Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -4088,8 +4088,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x2_t vld1q_p64_x2(poly64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A32/A64` | | float64x1x2_t vld1_f64_x2(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt2.1D},[Xn]` | `Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x2_t vld1q_f64_x2(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt2.2D},[Xn]` | `Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x2_t vld1_fm8_x2(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x2_t vld1q_fm8_x2(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x2_t vld1_mf8_x2(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt2.8B},[Xn]` | `Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x2_t vld1q_mf8_x2(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt2.16B},[Xn]` | `Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x3_t vld1_s8_x3(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x3_t vld1q_s8_x3(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x3_t vld1_s16_x3(int16_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.4H - Vt3.4H},[Xn]` | `Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -4118,8 +4118,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x3_t vld1q_p64_x3(poly64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A32/A64` | | float64x1x3_t vld1_f64_x3(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt3.1D},[Xn]` | `Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x3_t vld1q_f64_x3(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt3.2D},[Xn]` | `Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x3_t vld1_fm8_x3(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x3_t vld1q_fm8_x3(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x3_t vld1_mf8_x3(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt3.8B},[Xn]` | `Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x3_t vld1q_mf8_x3(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt3.16B},[Xn]` | `Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | | int8x8x4_t vld1_s8_x4(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `v7/A32/A64` | | int8x16x4_t vld1q_s8_x4(int8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `v7/A32/A64` | | int16x4x4_t vld1_s16_x4(int16_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.4H - Vt4.4H},[Xn]` | `Vt4.4H -> result.val[3]`
`Vt3.4H -> result.val[2]`
`Vt2.4H -> result.val[1]`
`Vt.4H -> result.val[0]` | `v7/A32/A64` | @@ -4148,8 +4148,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | poly64x2x4_t vld1q_p64_x4(poly64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A32/A64` | | float64x1x4_t vld1_f64_x4(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.1D - Vt4.1D},[Xn]` | `Vt4.1D -> result.val[3]`
`Vt3.1D -> result.val[2]`
`Vt2.1D -> result.val[1]`
`Vt.1D -> result.val[0]` | `A64` | | float64x2x4_t vld1q_f64_x4(float64_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.2D - Vt4.2D},[Xn]` | `Vt4.2D -> result.val[3]`
`Vt3.2D -> result.val[2]`
`Vt2.2D -> result.val[1]`
`Vt.2D -> result.val[0]` | `A64` | -| floatm8x8x4_t vld1_fm8_x4(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | -| floatm8x16x4_t vld1q_fm8_x4(floatm8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | +| mfloat8x8x4_t vld1_mf8_x4(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.8B - Vt4.8B},[Xn]` | `Vt4.8B -> result.val[3]`
`Vt3.8B -> result.val[2]`
`Vt2.8B -> result.val[1]`
`Vt.8B -> result.val[0]` | `A64` | +| mfloat8x16x4_t vld1q_mf8_x4(mfloat8_t const *ptr) | `ptr -> Xn` | `LD1 {Vt.16B - Vt4.16B},[Xn]` | `Vt4.16B -> result.val[3]`
`Vt3.16B -> result.val[2]`
`Vt2.16B -> result.val[1]`
`Vt.16B -> result.val[0]` | `A64` | #### Load @@ -4191,8 +4191,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst1q_p16(
     poly16_t *ptr,
     poly16x8_t val)
| `val -> Vt.8H`
`ptr -> Xn` | `ST1 {Vt.8H},[Xn]` | | `v7/A32/A64` | | void vst1_f64(
     float64_t *ptr,
     float64x1_t val)
| `val -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D},[Xn]` | | `A64` | | void vst1q_f64(
     float64_t *ptr,
     float64x2_t val)
| `val -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D},[Xn]` | | `A64` | -| void vst1_fm8(
     floatm8_t *ptr,
     floatm8x8_t val)
| `val -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B},[Xn]` | | `A64` | -| void vst1q_fm8(
     floatm8_t *ptr,
     floatm8x16_t val)
| `val -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B},[Xn]` | | `A64` | +| void vst1_mf8(
     mfloat8_t *ptr,
     mfloat8x8_t val)
| `val -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B},[Xn]` | | `A64` | +| void vst1q_mf8(
     mfloat8_t *ptr,
     mfloat8x16_t val)
| `val -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B},[Xn]` | | `A64` | | void vst1_lane_s8(
     int8_t *ptr,
     int8x8_t val,
     const int lane)
| `val -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST1 {Vt.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst1q_lane_s8(
     int8_t *ptr,
     int8x16_t val,
     const int lane)
| `val -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST1 {Vt.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst1_lane_s16(
     int16_t *ptr,
     int16x4_t val,
     const int lane)
| `val -> Vt.4H`
`ptr -> Xn`
`0 <= lane <= 3` | `ST1 {Vt.h}[lane],[Xn]` | | `v7/A32/A64` | @@ -4221,8 +4221,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst1q_lane_p16(
     poly16_t *ptr,
     poly16x8_t val,
     const int lane)
| `val -> Vt.8H`
`ptr -> Xn`
`0 <= lane <= 7` | `ST1 {Vt.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst1_lane_f64(
     float64_t *ptr,
     float64x1_t val,
     const int lane)
| `val -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `ST1 {Vt.d}[lane],[Xn]` | | `A64` | | void vst1q_lane_f64(
     float64_t *ptr,
     float64x2_t val,
     const int lane)
| `val -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST1 {Vt.d}[lane],[Xn]` | | `A64` | -| void vst1_lane_fm8(
     floatm8_t *ptr,
     floatm8x8_t val,
     const int lane)
| `val -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST1 {Vt.b}[lane],[Xn]` | | `A64` | -| void vst1q_lane_fm8(
     floatm8_t *ptr,
     floatm8x16_t val,
     const int lane)
| `val -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST1 {Vt.b}[lane],[Xn]` | | `A64` | +| void vst1_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x8_t val,
     const int lane)
| `val -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST1 {Vt.b}[lane],[Xn]` | | `A64` | +| void vst1q_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x16_t val,
     const int lane)
| `val -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST1 {Vt.b}[lane],[Xn]` | | `A64` | | void vst2_s8(
     int8_t *ptr,
     int8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST2 {Vt.8B - Vt2.8B},[Xn]` | | `v7/A32/A64` | | void vst2q_s8(
     int8_t *ptr,
     int8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST2 {Vt.16B - Vt2.16B},[Xn]` | | `v7/A32/A64` | | void vst2_s16(
     int16_t *ptr,
     int16x4x2_t val)
| `val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST2 {Vt.4H - Vt2.4H},[Xn]` | | `v7/A32/A64` | @@ -4251,8 +4251,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst2q_p64(
     poly64_t *ptr,
     poly64x2x2_t val)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST2 {Vt.2D - Vt2.2D},[Xn]` | | `A64` | | void vst2_f64(
     float64_t *ptr,
     float64x1x2_t val)
| `val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt2.1D},[Xn]` | | `A64` | | void vst2q_f64(
     float64_t *ptr,
     float64x2x2_t val)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST2 {Vt.2D - Vt2.2D},[Xn]` | | `A64` | -| void vst2_fm8(
     floatm8_t *ptr,
     floatm8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST2 {Vt.8B - Vt2.8B},[Xn]` | | `A64` | -| void vst2q_fm8(
     floatm8_t *ptr,
     floatm8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST2 {Vt.16B - Vt2.16B},[Xn]` | | `A64` | +| void vst2_mf8(
     mfloat8_t *ptr,
     mfloat8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST2 {Vt.8B - Vt2.8B},[Xn]` | | `A64` | +| void vst2q_mf8(
     mfloat8_t *ptr,
     mfloat8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST2 {Vt.16B - Vt2.16B},[Xn]` | | `A64` | | void vst3_s8(
     int8_t *ptr,
     int8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST3 {Vt.8B - Vt3.8B},[Xn]` | | `v7/A32/A64` | | void vst3q_s8(
     int8_t *ptr,
     int8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST3 {Vt.16B - Vt3.16B},[Xn]` | | `v7/A32/A64` | | void vst3_s16(
     int16_t *ptr,
     int16x4x3_t val)
| `val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST3 {Vt.4H - Vt3.4H},[Xn]` | | `v7/A32/A64` | @@ -4281,8 +4281,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst3q_p64(
     poly64_t *ptr,
     poly64x2x3_t val)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST3 {Vt.2D - Vt3.2D},[Xn]` | | `A64` | | void vst3_f64(
     float64_t *ptr,
     float64x1x3_t val)
| `val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt3.1D},[Xn]` | | `A64` | | void vst3q_f64(
     float64_t *ptr,
     float64x2x3_t val)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST3 {Vt.2D - Vt3.2D},[Xn]` | | `A64` | -| void vst3_fm8(
     floatm8_t *ptr,
     floatm8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST3 {Vt.8B - Vt3.8B},[Xn]` | | `A64` | -| void vst3q_fm8(
     floatm8_t *ptr,
     floatm8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST3 {Vt.16B - Vt3.16B},[Xn]` | | `A64` | +| void vst3_mf8(
     mfloat8_t *ptr,
     mfloat8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST3 {Vt.8B - Vt3.8B},[Xn]` | | `A64` | +| void vst3q_mf8(
     mfloat8_t *ptr,
     mfloat8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST3 {Vt.16B - Vt3.16B},[Xn]` | | `A64` | | void vst4_s8(
     int8_t *ptr,
     int8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST4 {Vt.8B - Vt4.8B},[Xn]` | | `v7/A32/A64` | | void vst4q_s8(
     int8_t *ptr,
     int8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST4 {Vt.16B - Vt4.16B},[Xn]` | | `v7/A32/A64` | | void vst4_s16(
     int16_t *ptr,
     int16x4x4_t val)
| `val.val[3] -> Vt4.4H`
`val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST4 {Vt.4H - Vt4.4H},[Xn]` | | `v7/A32/A64` | @@ -4311,8 +4311,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst4q_p64(
     poly64_t *ptr,
     poly64x2x4_t val)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST4 {Vt.2D - Vt4.2D},[Xn]` | | `A64` | | void vst4_f64(
     float64_t *ptr,
     float64x1x4_t val)
| `val.val[3] -> Vt4.1D`
`val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt4.1D},[Xn]` | | `A64` | | void vst4q_f64(
     float64_t *ptr,
     float64x2x4_t val)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST4 {Vt.2D - Vt4.2D},[Xn]` | | `A64` | -| void vst4_fm8(
     floatm8_t *ptr,
     floatm8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST4 {Vt.8B - Vt4.8B},[Xn]` | | `A64` | -| void vst4q_fm8(
     floatm8_t *ptr,
     floatm8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST4 {Vt.16B - Vt4.16B},[Xn]` | | `A64` | +| void vst4_mf8(
     mfloat8_t *ptr,
     mfloat8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST4 {Vt.8B - Vt4.8B},[Xn]` | | `A64` | +| void vst4q_mf8(
     mfloat8_t *ptr,
     mfloat8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST4 {Vt.16B - Vt4.16B},[Xn]` | | `A64` | | void vst2_lane_s8(
     int8_t *ptr,
     int8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst2_lane_u8(
     uint8_t *ptr,
     uint8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst2_lane_p8(
     poly8_t *ptr,
     poly8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `v7/A32/A64` | @@ -4322,9 +4322,9 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst4_lane_s8(
     int8_t *ptr,
     int8x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst4_lane_u8(
     uint8_t *ptr,
     uint8x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `v7/A32/A64` | | void vst4_lane_p8(
     poly8_t *ptr,
     poly8x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `v7/A32/A64` | -| void vst2_lane_fm8(
     floatm8_t *ptr,
     floatm8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | -| void vst3_lane_fm8(
     floatm8_t *ptr,
     floatm8x8x3_t val,
     const int lane)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST3 {Vt.b - Vt3.b}[lane],[Xn]` | | `A64` | -| void vst4_lane_fm8(
     floatm8_t *ptr,
     floatm8x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `A64` | +| void vst2_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | +| void vst3_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x8x3_t val,
     const int lane)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST3 {Vt.b - Vt3.b}[lane],[Xn]` | | `A64` | +| void vst4_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `A64` | | void vst2_lane_s16(
     int16_t *ptr,
     int16x4x2_t val,
     const int lane)
| `val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn`
`0 <= lane <= 3` | `ST2 {Vt.h - Vt2.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst2q_lane_s16(
     int16_t *ptr,
     int16x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8H`
`val.val[0] -> Vt.8H`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.h - Vt2.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst2_lane_s32(
     int32_t *ptr,
     int32x2x2_t val,
     const int lane)
| `val.val[1] -> Vt2.2S`
`val.val[0] -> Vt.2S`
`ptr -> Xn`
`0 <= lane <= 1` | `ST2 {Vt.s - Vt2.s}[lane],[Xn]` | | `v7/A32/A64` | @@ -4350,8 +4350,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst2q_lane_p64(
     poly64_t *ptr,
     poly64x2x2_t val,
     const int lane)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST2 {Vt.d - Vt2.d}[lane],[Xn]` | | `A64` | | void vst2_lane_f64(
     float64_t *ptr,
     float64x1x2_t val,
     const int lane)
| `val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `ST2 {Vt.d - Vt2.d}[lane],[Xn]` | | `A64` | | void vst2q_lane_f64(
     float64_t *ptr,
     float64x2x2_t val,
     const int lane)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 2` | `ST2 {Vt.d - Vt2.d}[lane],[Xn]` | | `A64` | -| void vst2_lane_fm8(
     floatm8_t *ptr,
     floatm8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | -| void vst2q_lane_fm8(
     floatm8_t *ptr,
     floatm8x16x2_t val,
     const int lane)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | +| void vst2_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x8x2_t val,
     const int lane)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn`
`0 <= lane <= 7` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | +| void vst2q_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x16x2_t val,
     const int lane)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST2 {Vt.b - Vt2.b}[lane],[Xn]` | | `A64` | | void vst3_lane_s16(
     int16_t *ptr,
     int16x4x3_t val,
     const int lane)
| `val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn`
`0 <= lane <= 3` | `ST3 {Vt.h - Vt3.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst3q_lane_s16(
     int16_t *ptr,
     int16x8x3_t val,
     const int lane)
| `val.val[2] -> Vt3.8H`
`val.val[1] -> Vt2.8H`
`val.val[0] -> Vt.8H`
`ptr -> Xn`
`0 <= lane <= 7` | `ST3 {Vt.h - Vt3.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst3_lane_s32(
     int32_t *ptr,
     int32x2x3_t val,
     const int lane)
| `val.val[2] -> Vt3.2S`
`val.val[1] -> Vt2.2S`
`val.val[0] -> Vt.2S`
`ptr -> Xn`
`0 <= lane <= 1` | `ST3 {Vt.s - Vt3.s}[lane],[Xn]` | | `v7/A32/A64` | @@ -4377,7 +4377,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst3q_lane_p64(
     poly64_t *ptr,
     poly64x2x3_t val,
     const int lane)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST3 {Vt.d - Vt3.d}[lane],[Xn]` | | `A64` | | void vst3_lane_f64(
     float64_t *ptr,
     float64x1x3_t val,
     const int lane)
| `val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `ST3 {Vt.d - Vt3.d}[lane],[Xn]` | | `A64` | | void vst3q_lane_f64(
     float64_t *ptr,
     float64x2x3_t val,
     const int lane)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST3 {Vt.d - Vt3.d}[lane],[Xn]` | | `A64` | -| void vst3q_lane_fm8(
     floatm8_t *ptr,
     floatm8x16x3_t val,
     const int lane)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST3 {Vt.b - Vt3.b}[lane],[Xn]` | | `A64` | +| void vst3q_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x16x3_t val,
     const int lane)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST3 {Vt.b - Vt3.b}[lane],[Xn]` | | `A64` | | void vst4_lane_s16(
     int16_t *ptr,
     int16x4x4_t val,
     const int lane)
| `val.val[3] -> Vt4.4H`
`val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn`
`0 <= lane <= 3` | `ST4 {Vt.h - Vt4.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst4q_lane_s16(
     int16_t *ptr,
     int16x8x4_t val,
     const int lane)
| `val.val[3] -> Vt4.8H`
`val.val[2] -> Vt3.8H`
`val.val[1] -> Vt2.8H`
`val.val[0] -> Vt.8H`
`ptr -> Xn`
`0 <= lane <= 7` | `ST4 {Vt.h - Vt4.h}[lane],[Xn]` | | `v7/A32/A64` | | void vst4_lane_s32(
     int32_t *ptr,
     int32x2x4_t val,
     const int lane)
| `val.val[3] -> Vt4.2S`
`val.val[2] -> Vt3.2S`
`val.val[1] -> Vt2.2S`
`val.val[0] -> Vt.2S`
`ptr -> Xn`
`0 <= lane <= 1` | `ST4 {Vt.s - Vt4.s}[lane],[Xn]` | | `v7/A32/A64` | @@ -4403,7 +4403,7 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst4q_lane_p64(
     poly64_t *ptr,
     poly64x2x4_t val,
     const int lane)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST4 {Vt.d - Vt4.d}[lane],[Xn]` | | `A64` | | void vst4_lane_f64(
     float64_t *ptr,
     float64x1x4_t val,
     const int lane)
| `val.val[3] -> Vt4.1D`
`val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn`
`0 <= lane <= 0` | `ST4 {Vt.d - Vt4.d}[lane],[Xn]` | | `A64` | | void vst4q_lane_f64(
     float64_t *ptr,
     float64x2x4_t val,
     const int lane)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn`
`0 <= lane <= 1` | `ST4 {Vt.d - Vt4.d}[lane],[Xn]` | | `A64` | -| void vst4q_lane_fm8(
     floatm8_t *ptr,
     floatm8x16x4_t val,
     const int lane)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `A64` | +| void vst4q_lane_mf8(
     mfloat8_t *ptr,
     mfloat8x16x4_t val,
     const int lane)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn`
`0 <= lane <= 15` | `ST4 {Vt.b - Vt4.b}[lane],[Xn]` | | `A64` | | void vst1_s8_x2(
     int8_t *ptr,
     int8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt2.8B},[Xn]` | | `v7/A32/A64` | | void vst1q_s8_x2(
     int8_t *ptr,
     int8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt2.16B},[Xn]` | | `v7/A32/A64` | | void vst1_s16_x2(
     int16_t *ptr,
     int16x4x2_t val)
| `val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST1 {Vt.4H - Vt2.4H},[Xn]` | | `v7/A32/A64` | @@ -4432,8 +4432,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst1q_p64_x2(
     poly64_t *ptr,
     poly64x2x2_t val)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt2.2D},[Xn]` | | `A32/A64` | | void vst1_f64_x2(
     float64_t *ptr,
     float64x1x2_t val)
| `val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt2.1D},[Xn]` | | `A64` | | void vst1q_f64_x2(
     float64_t *ptr,
     float64x2x2_t val)
| `val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt2.2D},[Xn]` | | `A64` | -| void vst1_fm8_x2(
     floatm8_t *ptr,
     floatm8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt2.8B},[Xn]` | | `A64` | -| void vst1q_fm8_x2(
     floatm8_t *ptr,
     floatm8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt2.16B},[Xn]` | | `A64` | +| void vst1_mf8_x2(
     mfloat8_t *ptr,
     mfloat8x8x2_t val)
| `val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt2.8B},[Xn]` | | `A64` | +| void vst1q_mf8_x2(
     mfloat8_t *ptr,
     mfloat8x16x2_t val)
| `val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt2.16B},[Xn]` | | `A64` | | void vst1_s8_x3(
     int8_t *ptr,
     int8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt3.8B},[Xn]` | | `v7/A32/A64` | | void vst1q_s8_x3(
     int8_t *ptr,
     int8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt3.16B},[Xn]` | | `v7/A32/A64` | | void vst1_s16_x3(
     int16_t *ptr,
     int16x4x3_t val)
| `val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST1 {Vt.4H - Vt3.4H},[Xn]` | | `v7/A32/A64` | @@ -4462,8 +4462,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst1q_p64_x3(
     poly64_t *ptr,
     poly64x2x3_t val)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt3.2D},[Xn]` | | `v7/A32/A64` | | void vst1_f64_x3(
     float64_t *ptr,
     float64x1x3_t val)
| `val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt3.1D},[Xn]` | | `A64` | | void vst1q_f64_x3(
     float64_t *ptr,
     float64x2x3_t val)
| `val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt3.2D},[Xn]` | | `A64` | -| void vst1_fm8_x3(
     floatm8_t *ptr,
     floatm8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt3.8B},[Xn]` | | `A64` | -| void vst1q_fm8_x3(
     floatm8_t *ptr,
     floatm8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt3.16B},[Xn]` | | `A64` | +| void vst1_mf8_x3(
     mfloat8_t *ptr,
     mfloat8x8x3_t val)
| `val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt3.8B},[Xn]` | | `A64` | +| void vst1q_mf8_x3(
     mfloat8_t *ptr,
     mfloat8x16x3_t val)
| `val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt3.16B},[Xn]` | | `A64` | | void vst1_s8_x4(
     int8_t *ptr,
     int8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt4.8B},[Xn]` | | `v7/A32/A64` | | void vst1q_s8_x4(
     int8_t *ptr,
     int8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt4.16B},[Xn]` | | `v7/A32/A64` | | void vst1_s16_x4(
     int16_t *ptr,
     int16x4x4_t val)
| `val.val[3] -> Vt4.4H`
`val.val[2] -> Vt3.4H`
`val.val[1] -> Vt2.4H`
`val.val[0] -> Vt.4H`
`ptr -> Xn` | `ST1 {Vt.4H - Vt4.4H},[Xn]` | | `v7/A32/A64` | @@ -4492,8 +4492,8 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | void vst1q_p64_x4(
     poly64_t *ptr,
     poly64x2x4_t val)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt4.2D},[Xn]` | | `A32/A64` | | void vst1_f64_x4(
     float64_t *ptr,
     float64x1x4_t val)
| `val.val[3] -> Vt4.1D`
`val.val[2] -> Vt3.1D`
`val.val[1] -> Vt2.1D`
`val.val[0] -> Vt.1D`
`ptr -> Xn` | `ST1 {Vt.1D - Vt4.1D},[Xn]` | | `A64` | | void vst1q_f64_x4(
     float64_t *ptr,
     float64x2x4_t val)
| `val.val[3] -> Vt4.2D`
`val.val[2] -> Vt3.2D`
`val.val[1] -> Vt2.2D`
`val.val[0] -> Vt.2D`
`ptr -> Xn` | `ST1 {Vt.2D - Vt4.2D},[Xn]` | | `A64` | -| void vst1_fm8_x4(
     int8_t *ptr,
     int8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt4.8B},[Xn]` | | `v7/A32/A64` | -| void vst1q_fm8_x4(
     int8_t *ptr,
     int8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt4.16B},[Xn]` | | `v7/A32/A64` | +| void vst1_mf8_x4(
     int8_t *ptr,
     int8x8x4_t val)
| `val.val[3] -> Vt4.8B`
`val.val[2] -> Vt3.8B`
`val.val[1] -> Vt2.8B`
`val.val[0] -> Vt.8B`
`ptr -> Xn` | `ST1 {Vt.8B - Vt4.8B},[Xn]` | | `v7/A32/A64` | +| void vst1q_mf8_x4(
     int8_t *ptr,
     int8x16x4_t val)
| `val.val[3] -> Vt4.16B`
`val.val[2] -> Vt3.16B`
`val.val[1] -> Vt2.16B`
`val.val[0] -> Vt.16B`
`ptr -> Xn` | `ST1 {Vt.16B - Vt4.16B},[Xn]` | | `v7/A32/A64` | #### Store @@ -5916,52 +5916,52 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------|-------------------|---------------------------| -| bfloat16x8_t vcvt1_bf16_fm8_fpm(
     floatm8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| bfloat16x8_t vcvt1_low_bf16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| bfloat16x8_t vcvt2_bf16_fm8_fpm(
     floatm8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| bfloat16x8_t vcvt2_low_bf16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| float16x8_t vcvt1_bf16_fm8_fpm(
     floatm8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt1_bf16_mf8_fpm(
     mfloat8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt1_low_bf16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt2_bf16_mf8_fpm(
     mfloat8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt2_low_bf16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `BF2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt1_bf16_mf8_fpm(
     mfloat8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | #### Convert to BFloat16 (vector, upper) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-------------------------|-------------------|---------------------------| -| bfloat16x8_t vcvt1_high_bf16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `BF1CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | -| bfloat16x8_t vcvt2_high_bf16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `BF2CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt1_high_bf16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `BF1CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | +| bfloat16x8_t vcvt2_high_bf16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `BF2CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | #### Convert to half-precision (vector, lower) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|-----------------------|-------------------|---------------------------| -| float16x8_t vcvt1_low_f16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| float16x8_t vcvt2_f16_fm8_fpm(
     floatm8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | -| float16x8_t vcvt2_low_f16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt1_low_f16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F1CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt2_f16_mf8_fpm(
     mfloat8x8_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt2_low_f16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.8B` | `F2CVTL Vd.8H,Vn.8B` | `Vd.8H -> result` | `A64` | #### Convert to half-precision (vector, upper) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|------------------------|-------------------|---------------------------| -| float16x8_t vcvt1_high_f16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `F1CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | -| float16x8_t vcvt2_high_f16_fm8_fpm(
     floatm8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `F2CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt1_high_f16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `F1CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | +| float16x8_t vcvt2_high_f16_mf8_fpm(
     mfloat8x16_t vn,
     fpm_t fpm)
| `vn -> Vn.16B` | `F2CVTL2 Vd.8H,Vn.16B` | `Vd.8H -> result` | `A64` | #### Convert single-precision to floating point (vector, lower) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|-----------------------------|-------------------|---------------------------| -| floatm8x8_t vcvt_fm8_f32_fpm(
     float32x4_t vn,
     float32x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4S`
`vm -> Vm.4S` | `FCVTN Vd.8B, Vn.4S, Vm.4S` | `Vd.8B -> result` | `A64` | +| mfloat8x8_t vcvt_mf8_f32_fpm(
     float32x4_t vn,
     float32x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4S`
`vm -> Vm.4S` | `FCVTN Vd.8B, Vn.4S, Vm.4S` | `Vd.8B -> result` | `A64` | #### Convert single-precision to floating point (vector, upper) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|-------------------------------|--------------------|---------------------------| -| floatm8x16_t vcvt_high_f32_fpm(
     floatm8x8_t vd,
     float32x4_t vn,
     float32x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4S`
`vm -> Vm.4S` | `FCVTN2 Vd.16B, Vn.4S, Vm.4S` | `Vd.16B -> result` | `A64` | +| mfloat8x16_t vcvt_high_f32_fpm(
     mfloat8x8_t vd,
     float32x4_t vn,
     float32x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4S`
`vm -> Vm.4S` | `FCVTN2 Vd.16B, Vn.4S, Vm.4S` | `Vd.16B -> result` | `A64` | #### Convert half-precision to floating point | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|------------------------------|--------------------|---------------------------| -| floatm8x8_t vcvt_fm8_f16_fpm(
     float16x4_t vn,
     float16x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4H`
`vm -> Vm.4H` | `FCVTN Vd.8B, Vn.4H, Vm.4H` | `Vd.8B -> result` | `A64` | -| floatm8x16_t vcvtq_fm8_f16_fpm(
     float16x8_t vn,
     float16x8_t vm,
     fpm_t fpm)
| `vn -> Vn.8H`
`vm -> Vm.8H` | `FCVTN Vd.16B, Vn.8H, Vm.8H` | `Vd.16B -> result` | `A64` | +| mfloat8x8_t vcvt_mf8_f16_fpm(
     float16x4_t vn,
     float16x4_t vm,
     fpm_t fpm)
| `vn -> Vn.4H`
`vm -> Vm.4H` | `FCVTN Vd.8B, Vn.4H, Vm.4H` | `Vd.8B -> result` | `A64` | +| mfloat8x16_t vcvtq_mf8_f16_fpm(
     float16x8_t vn,
     float16x8_t vm,
     fpm_t fpm)
| `vn -> Vn.8H`
`vm -> Vm.8H` | `FCVTN Vd.16B, Vn.8H, Vm.8H` | `Vd.16B -> result` | `A64` | ### Floating-point adjust exponent by vector @@ -5979,33 +5979,33 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|------------------------------|-------------------|---------------------------| -| float32x2_t vdot_f32_fm8_fpm(
     float32x2_t vd,
     floatm8x8_t vn,
     floatm8x8_t vm,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.8B`
`vm -> Vm.8B` | `FDOT Vd.2S, Vn.8B, Vm.8B` | `Vd.2S -> result` | `A64` | -| float32x4_t vdotq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FDOT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | +| float32x2_t vdot_f32_mf8_fpm(
     float32x2_t vd,
     mfloat8x8_t vn,
     mfloat8x8_t vm,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.8B`
`vm -> Vm.8B` | `FDOT Vd.2S, Vn.8B, Vm.8B` | `Vd.2S -> result` | `A64` | +| float32x4_t vdotq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FDOT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | #### Floating-point dot product to single-precision (vector, by element) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------|--------------------|---------------------------| -| float32x2_t vdot_lane_f32_fm8_fpm(
     float32x2_t vd,
     floatm8x8_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.8B`
`vm -> Vm.4B`
`0 <= lane <= 1` | `FDOT Vd.2S, Vn.8B, Vm.4B[lane]` | `Vd.2S -> result` | `A64` | -| float32x2_t vdot_laneq_f32_fm8_fpm(
     float32x2_t vd,
     floatm8x8_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.16B`
`vm -> Vm.4B`
`0 <= lane <= 3` | `FDOT Vd.2S, Vn.8B, Vm.4B[lane]` | `Vd.2S -> result` | `A64` | -| float32x4_t vdotq_lane_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.8B`
`vm -> Vm.4B`
`0 <= lane <= 1` | `FDOT Vd.4S, Vn.8B, Vm.4B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vdotq_laneq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16`
`vm -> Vm.4B`
`0 <= lane <= 3` | `FDOT Vd.4S, Vn.8B, Vm.4B[lane]` | `Vd.4SB -> result` | `A64` | +| float32x2_t vdot_lane_f32_mf8_fpm(
     float32x2_t vd,
     mfloat8x8_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.8B`
`vm -> Vm.4B`
`0 <= lane <= 1` | `FDOT Vd.2S, Vn.8B, Vm.4B[lane]` | `Vd.2S -> result` | `A64` | +| float32x2_t vdot_laneq_f32_mf8_fpm(
     float32x2_t vd,
     mfloat8x8_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.2S`
`vn -> Vn.16B`
`vm -> Vm.4B`
`0 <= lane <= 3` | `FDOT Vd.2S, Vn.8B, Vm.4B[lane]` | `Vd.2S -> result` | `A64` | +| float32x4_t vdotq_lane_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.8B`
`vm -> Vm.4B`
`0 <= lane <= 1` | `FDOT Vd.4S, Vn.8B, Vm.4B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vdotq_laneq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16`
`vm -> Vm.4B`
`0 <= lane <= 3` | `FDOT Vd.4S, Vn.8B, Vm.4B[lane]` | `Vd.4SB -> result` | `A64` | #### Floating-point dot product to half-precision (vector) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|------------------------------|-------------------|---------------------------| -| float16x4_t vdot_f16_fm8_fpm(
     float16x4_t vd,
     floatm8x8_t vn,
     floatm8x8_t vm,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.8B` | `FDOT Vd.4H, Vn.8B, Vm.8B` | `Vd.4H -> result` | `A64` | -| float16x8_t vdotq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FDOT Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | +| float16x4_t vdot_f16_mf8_fpm(
     float16x4_t vd,
     mfloat8x8_t vn,
     mfloat8x8_t vm,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.8B` | `FDOT Vd.4H, Vn.8B, Vm.8B` | `Vd.4H -> result` | `A64` | +| float16x8_t vdotq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FDOT Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | #### Floating-point dot product to half-prevision (vector, by element) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|-----------------------------------|-------------------|---------------------------| -| float16x4_t vdot_lane_f16_fm8_fpm(
     float16x4_t vd,
     floatm8x8_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.2B`
`0 <= lane <= 3` | `FDOT Vd.4H, Vn.8B, Vm.2B[lane]` | `Vd.4H -> result` | `A64` | -| float16x4_t vdot_laneq_f16_fm8_fpm(
     float16x4_t vd,
     floatm8x8_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.2B`
`0 <= lane <= 7` | `FDOT Vd.4H, Vn.8B, Vm.2B[lane]` | `Vd.4H -> result` | `A64` | -| float16x8_t vdotq_lane_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.2B`
`0 <= lane <= 3` | `FDOT Vd.8H, Vn.16B, Vm.2B[lane]` | `Vd.8H -> result` | `A64` | -| float16x8_t vdotq_laneq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.2B`
`0 <= lane <= 7` | `FDOT Vd.8H, Vn.16B, Vm.2B[lane]` | `Vd.8H -> result` | `A64` | +| float16x4_t vdot_lane_f16_mf8_fpm(
     float16x4_t vd,
     mfloat8x8_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.2B`
`0 <= lane <= 3` | `FDOT Vd.4H, Vn.8B, Vm.2B[lane]` | `Vd.4H -> result` | `A64` | +| float16x4_t vdot_laneq_f16_mf8_fpm(
     float16x4_t vd,
     mfloat8x8_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4H`
`vn -> Vn.8B`
`vm -> Vm.2B`
`0 <= lane <= 7` | `FDOT Vd.4H, Vn.8B, Vm.2B[lane]` | `Vd.4H -> result` | `A64` | +| float16x8_t vdotq_lane_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.2B`
`0 <= lane <= 3` | `FDOT Vd.8H, Vn.16B, Vm.2B[lane]` | `Vd.8H -> result` | `A64` | +| float16x8_t vdotq_laneq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.2B`
`0 <= lane <= 7` | `FDOT Vd.8H, Vn.16B, Vm.2B[lane]` | `Vd.8H -> result` | `A64` | ### Multiply-add @@ -6013,36 +6013,36 @@ The intrinsics in this section are guarded by the macro ``__ARM_NEON``. | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|--------------------------------|-------------------|---------------------------| -| float16x8_t vmlalbq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALB Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | -| float16x8_t vmlaltq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALT Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlalbq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALB Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlaltq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALT Vd.8H, Vn.16B, Vm.16B` | `Vd.8H -> result` | `A64` | #### Floating-point multiply-add long to half-precision (vector, by element) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|------------------------------------|-------------------|---------------------------| -| float16x8_t vmlalbq_lane_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALB Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | -| float16x8_t vmlalbq_laneq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALB Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | -| float16x8_t vmlaltq_lane_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALT Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | -| float16x8_t vmlaltq_laneq_f16_fm8_fpm(
     float16x8_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALT Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlalbq_lane_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALB Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlalbq_laneq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALB Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlaltq_lane_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALT Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | +| float16x8_t vmlaltq_laneq_f16_mf8_fpm(
     float16x8_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.8H`
`vn -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALT Vd.8H, Vn.16B, Vm.B[lane]` | `Vd.8H -> result` | `A64` | #### Floating-point multiply-add long-long to single-precision (vector) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------|----------------------------------|-------------------|---------------------------| -| float32x4_t vmlallbbq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLBB Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallbtq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLBT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlalltbq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLTB Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallttq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLTT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbbq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLBB Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbtq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLBT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlalltbq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLTB Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallttq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     fpm_t fpm)
| `vd -> Vd.4S`
`vn -> Vn.16B`
`vm -> Vm.16B` | `FMLALLTT Vd.4S, Vn.16B, Vm.16B` | `Vd.4S -> result` | `A64` | #### Floating-point multiply-add long-long to single-precision (vector, by element) | Intrinsic | Argument preparation | AArch64 Instruction | Result | Supported architectures | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--------------------------------------|-------------------|---------------------------| -| float32x4_t vmlallbbq_lane_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallbbq_laneq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallbtq_lane_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallbtq_laneq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlalltbq_lane_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlalltbq_laneq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallttq_lane_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | -| float32x4_t vmlallttq_laneq_f32_fm8_fpm(
     float32x4_t vd,
     floatm8x16_t vn,
     floatm8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbbq_lane_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbbq_laneq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbtq_lane_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallbtq_laneq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlalltbq_lane_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlalltbq_laneq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallttq_lane_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x8_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 7` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | +| float32x4_t vmlallttq_laneq_f32_mf8_fpm(
     float32x4_t vd,
     mfloat8x16_t vn,
     mfloat8x16_t vm,
     const int lane,
     fpm_t fpm)
| `vd -> Vd.4S`
`vm -> Vn.16B`
`vm -> Vm.B`
`0 <= lane <= 15` | `FMLALLBB Vd.4S, Vn.16B, Vm.B[lane]` | `Vd.4S -> result` | `A64` | diff --git a/tools/intrinsic_db/advsimd.csv b/tools/intrinsic_db/advsimd.csv index 403f60b9..d3f1cde0 100644 --- a/tools/intrinsic_db/advsimd.csv +++ b/tools/intrinsic_db/advsimd.csv @@ -1844,8 +1844,8 @@ poly8x8_t vcopy_lane_p8(poly8x8_t a, __builtin_constant_p(lane1), poly8x8_t b, _ poly8x16_t vcopyq_lane_p8(poly8x16_t a, __builtin_constant_p(lane1), poly8x8_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.8B;0 <= lane2 <= 7 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 poly16x4_t vcopy_lane_p16(poly16x4_t a, __builtin_constant_p(lane1), poly16x4_t b, __builtin_constant_p(lane2)) a -> Vd.4H;0 <= lane1 <= 3;b -> Vn.4H;0 <= lane2 <= 3 INS Vd.H[lane1],Vn.H[lane2] Vd.4H -> result A64 poly16x8_t vcopyq_lane_p16(poly16x8_t a, __builtin_constant_p(lane1), poly16x4_t b, __builtin_constant_p(lane2)) a -> Vd.8H;0 <= lane1 <= 7;b -> Vn.4H;0 <= lane2 <= 3 INS Vd.H[lane1],Vn.H[lane2] Vd.8H -> result A64 -floatm8x8_t vcopy_lane_fm8(floatm8x8_t a, __builtin_constant_p(lane1), floatm8x8_t b, __builtin_constant_p(lane2)) a -> Vd.8B;0 <= lane1 <= 7;b -> Vn.8B;0 <= lane2 <= 7 INS Vd.B[lane1],Vn.B[lane2] Vd.8B -> result A64 -floatm8x16_t vcopyq_lane_fm8(floatm8x16_t a, __builtin_constant_p(lane1), floatm8x8_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.8B;0 <= lane2 <= 7 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 +mfloat8x8_t vcopy_lane_mf8(mfloat8x8_t a, __builtin_constant_p(lane1), mfloat8x8_t b, __builtin_constant_p(lane2)) a -> Vd.8B;0 <= lane1 <= 7;b -> Vn.8B;0 <= lane2 <= 7 INS Vd.B[lane1],Vn.B[lane2] Vd.8B -> result A64 +mfloat8x16_t vcopyq_lane_mf8(mfloat8x16_t a, __builtin_constant_p(lane1), mfloat8x8_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.8B;0 <= lane2 <= 7 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 int8x8_t vcopy_laneq_s8(int8x8_t a, __builtin_constant_p(lane1), int8x16_t b, __builtin_constant_p(lane2)) a -> Vd.8B;0 <= lane1 <= 7;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.8B -> result A64 int8x16_t vcopyq_laneq_s8(int8x16_t a, __builtin_constant_p(lane1), int8x16_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 int16x4_t vcopy_laneq_s16(int16x4_t a, __builtin_constant_p(lane1), int16x8_t b, __builtin_constant_p(lane2)) a -> Vd.4H;0 <= lane1 <= 3;b -> Vn.8H;0 <= lane2 <= 7 INS Vd.H[lane1],Vn.H[lane2] Vd.4H -> result A64 @@ -1872,8 +1872,8 @@ poly8x8_t vcopy_laneq_p8(poly8x8_t a, __builtin_constant_p(lane1), poly8x16_t b, poly8x16_t vcopyq_laneq_p8(poly8x16_t a, __builtin_constant_p(lane1), poly8x16_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 poly16x4_t vcopy_laneq_p16(poly16x4_t a, __builtin_constant_p(lane1), poly16x8_t b, __builtin_constant_p(lane2)) a -> Vd.4H;0 <= lane1 <= 3;b -> Vn.8H;0 <= lane2 <= 7 INS Vd.H[lane1],Vn.H[lane2] Vd.4H -> result A64 poly16x8_t vcopyq_laneq_p16(poly16x8_t a, __builtin_constant_p(lane1), poly16x8_t b, __builtin_constant_p(lane2)) a -> Vd.8H;0 <= lane1 <= 7;b -> Vn.8H;0 <= lane2 <= 7 INS Vd.H[lane1],Vn.H[lane2] Vd.8H -> result A64 -floatm8x8_t vcopy_laneq_fm8(floatm8x8_t a, __builtin_constant_p(lane1), floatm8x16_t b, __builtin_constant_p(lane2)) a -> Vd.8B;0 <= lane1 <= 7;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.8B -> result A64 -floatm8x16_t vcopyq_laneq_fm8(floatm8x16_t a, __builtin_constant_p(lane1), floatm8x16_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 +mfloat8x8_t vcopy_laneq_mf8(mfloat8x8_t a, __builtin_constant_p(lane1), mfloat8x16_t b, __builtin_constant_p(lane2)) a -> Vd.8B;0 <= lane1 <= 7;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.8B -> result A64 +mfloat8x16_t vcopyq_laneq_mf8(mfloat8x16_t a, __builtin_constant_p(lane1), mfloat8x16_t b, __builtin_constant_p(lane2)) a -> Vd.16B;0 <= lane1 <= 15;b -> Vn.16B;0 <= lane2 <= 15 INS Vd.B[lane1],Vn.B[lane2] Vd.16B -> result A64 int8x8_t vrbit_s8(int8x8_t a) a -> Vn.8B RBIT Vd.8B,Vn.8B Vd.8B -> result A64 int8x16_t vrbitq_s8(int8x16_t a) a -> Vn.16B RBIT Vd.16B,Vn.16B Vd.16B -> result A64 uint8x8_t vrbit_u8(uint8x8_t a) a -> Vn.8B RBIT Vd.8B,Vn.8B Vd.8B -> result A64 @@ -1894,7 +1894,7 @@ float32x2_t vcreate_f32(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.2S -> result v7/A3 poly8x8_t vcreate_p8(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.8B -> result v7/A32/A64 poly16x4_t vcreate_p16(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.4H -> result v7/A32/A64 float64x1_t vcreate_f64(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.1D -> result A64 -floatm8x8_t vcreate_fm8(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.8B -> result v7/A32/A64 +mfloat8x8_t vcreate_mf8(uint64_t a) a -> Xn INS Vd.D[0],Xn Vd.8B -> result v7/A32/A64 int8x8_t vdup_n_s8(int8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result v7/A32/A64 int8x16_t vdupq_n_s8(int8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result v7/A32/A64 int16x4_t vdup_n_s16(int16_t value) value -> rn DUP Vd.4H,rn Vd.4H -> result v7/A32/A64 @@ -1921,8 +1921,8 @@ poly16x4_t vdup_n_p16(poly16_t value) value -> rn DUP Vd.4H,rn Vd.4H -> result v poly16x8_t vdupq_n_p16(poly16_t value) value -> rn DUP Vd.8H,rn Vd.8H -> result v7/A32/A64 float64x1_t vdup_n_f64(float64_t value) value -> rn INS Dd.D[0],xn Vd.1D -> result A64 float64x2_t vdupq_n_f64(float64_t value) value -> rn DUP Vd.2D,rn Vd.2D -> result A64 -floatm8x8_t vdup_n_fm8(floatm8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result A64 -floatm8x16_t vdupq_n_fm8(floatm8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result A64 +mfloat8x8_t vdup_n_mf8(mfloat8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result A64 +mfloat8x16_t vdupq_n_mf8(mfloat8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result A64 int8x8_t vmov_n_s8(int8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result v7/A32/A64 int8x16_t vmovq_n_s8(int8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result v7/A32/A64 int16x4_t vmov_n_s16(int16_t value) value -> rn DUP Vd.4H,rn Vd.4H -> result v7/A32/A64 @@ -1947,8 +1947,8 @@ poly16x4_t vmov_n_p16(poly16_t value) value -> rn DUP Vd.4H,rn Vd.4H -> result v poly16x8_t vmovq_n_p16(poly16_t value) value -> rn DUP Vd.8H,rn Vd.8H -> result v7/A32/A64 float64x1_t vmov_n_f64(float64_t value) value -> rn DUP Vd.1D,rn Vd.1D -> result A64 float64x2_t vmovq_n_f64(float64_t value) value -> rn DUP Vd.2D,rn Vd.2D -> result A64 -floatm8x8_t vmov_n_fm8(floatm8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result A64 -floatm8x16_t vmovq_n_fm8(floatm8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result A64 +mfloat8x8_t vmov_n_mf8(mfloat8_t value) value -> rn DUP Vd.8B,rn Vd.8B -> result A64 +mfloat8x16_t vmovq_n_mf8(mfloat8_t value) value -> rn DUP Vd.16B,rn Vd.16B -> result A64 int8x8_t vdup_lane_s8(int8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.8B,Vn.B[lane] Vd.8B -> result v7/A32/A64 int8x16_t vdupq_lane_s8(int8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.16B,Vn.B[lane] Vd.16B -> result v7/A32/A64 int16x4_t vdup_lane_s16(int16x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4H;0 <= lane <= 3 DUP Vd.4H,Vn.H[lane] Vd.4H -> result v7/A32/A64 @@ -1975,8 +1975,8 @@ poly16x4_t vdup_lane_p16(poly16x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4 poly16x8_t vdupq_lane_p16(poly16x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4H;0 <= lane <= 3 DUP Vd.8H,Vn.H[lane] Vd.8H -> result v7/A32/A64 float64x1_t vdup_lane_f64(float64x1_t vec, __builtin_constant_p(lane)) vec -> Vn.1D;0 <= lane <= 0 DUP Dd,Vn.D[lane] Dd -> result A64 float64x2_t vdupq_lane_f64(float64x1_t vec, __builtin_constant_p(lane)) vec -> Vn.1D;0 <= lane <= 0 DUP Vd.2D,Vn.D[lane] Vd.2D -> result A64 -floatm8x8_t vdup_lane_fm8(floatm8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.8B,Vn.B[lane] Vd.8B -> result /A64 -floatm8x16_t vdupq_lane_fm8(floatm8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.16B,Vn.B[lane] Vd.16B -> result A64 +mfloat8x8_t vdup_lane_mf8(mfloat8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.8B,Vn.B[lane] Vd.8B -> result /A64 +mfloat8x16_t vdupq_lane_mf8(mfloat8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Vd.16B,Vn.B[lane] Vd.16B -> result A64 int8x8_t vdup_laneq_s8(int8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.8B,Vn.B[lane] Vd.8B -> result A64 int8x16_t vdupq_laneq_s8(int8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.16B,Vn.B[lane] Vd.16B -> result A64 int16x4_t vdup_laneq_s16(int16x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8H;0 <= lane <= 7 DUP Vd.4H,Vn.H[lane] Vd.4H -> result A64 @@ -2003,8 +2003,8 @@ poly16x4_t vdup_laneq_p16(poly16x8_t vec, __builtin_constant_p(lane)) vec -> Vn. poly16x8_t vdupq_laneq_p16(poly16x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8H;0 <= lane <= 7 DUP Vd.8H,Vn.H[lane] Vd.8H -> result A64 float64x1_t vdup_laneq_f64(float64x2_t vec, __builtin_constant_p(lane)) vec -> Vn.2D;0 <= lane <= 1 DUP Dd,Vn.D[lane] Dd -> result A64 float64x2_t vdupq_laneq_f64(float64x2_t vec, __builtin_constant_p(lane)) vec -> Vn.2D;0 <= lane <= 1 DUP Vd.2D,Vn.D[lane] Vd.2D -> result A64 -floatm8x8_t vdup_laneq_fm8(floatm8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.8B,Vn.B[lane] Vd.8B -> result A64 -floatm8x16_t vdupq_laneq_fm8(floatm8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.16B,Vn.B[lane] Vd.16B -> result A64 +mfloat8x8_t vdup_laneq_mf8(mfloat8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.8B,Vn.B[lane] Vd.8B -> result A64 +mfloat8x16_t vdupq_laneq_mf8(mfloat8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Vd.16B,Vn.B[lane] Vd.16B -> result A64 int8x16_t vcombine_s8(int8x8_t low, int8x8_t high) low -> Vn.8B;high -> Vm.8B DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.16B -> result v7/A32/A64 int16x8_t vcombine_s16(int16x4_t low, int16x4_t high) low -> Vn.4H;high -> Vm.4H DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.8H -> result v7/A32/A64 int32x4_t vcombine_s32(int32x2_t low, int32x2_t high) low -> Vn.2S;high -> Vm.2S DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.4S -> result v7/A32/A64 @@ -2019,7 +2019,7 @@ float32x4_t vcombine_f32(float32x2_t low, float32x2_t high) low -> Vn.2S;high -> poly8x16_t vcombine_p8(poly8x8_t low, poly8x8_t high) low -> Vn.8B;high -> Vm.8B DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.16B -> result v7/A32/A64 poly16x8_t vcombine_p16(poly16x4_t low, poly16x4_t high) low -> Vn.4H;high -> Vm.4H DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.8H -> result v7/A32/A64 float64x2_t vcombine_f64(float64x1_t low, float64x1_t high) low -> Vn.1D;high -> Vm.1D DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.2D -> result A64 -floatm8x16_t vcombine_fm8(floatm8x8_t low, floatm8x8_t high) low -> Vn.8B;high -> Vm.8B DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.16B -> result A64 +mfloat8x16_t vcombine_mf8(mfloat8x8_t low, mfloat8x8_t high) low -> Vn.8B;high -> Vm.8B DUP Vd.1D,Vn.D[0];INS Vd.D[1],Vm.D[0] Vd.16B -> result A64 int8x8_t vget_high_s8(int8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[1] Vd.8B -> result v7/A32/A64 int16x4_t vget_high_s16(int16x8_t a) a -> Vn.8H DUP Vd.1D,Vn.D[1] Vd.4H -> result v7/A32/A64 int32x2_t vget_high_s32(int32x4_t a) a -> Vn.4S DUP Vd.1D,Vn.D[1] Vd.2S -> result v7/A32/A64 @@ -2034,7 +2034,7 @@ float32x2_t vget_high_f32(float32x4_t a) a -> Vn.4S DUP Vd.1D,Vn.D[1] Vd.2S -> r poly8x8_t vget_high_p8(poly8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[1] Vd.8B -> result v7/A32/A64 poly16x4_t vget_high_p16(poly16x8_t a) a -> Vn.8H DUP Vd.1D,Vn.D[1] Vd.4H -> result v7/A32/A64 float64x1_t vget_high_f64(float64x2_t a) a -> Vn.2D DUP Vd.1D,Vn.D[1] Vd.1D -> result A64 -floatm8x8_t vget_high_fm8(floatm8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[1] Vd.8B -> result A64 +mfloat8x8_t vget_high_mf8(mfloat8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[1] Vd.8B -> result A64 int8x8_t vget_low_s8(int8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[0] Vd.8B -> result v7/A32/A64 int16x4_t vget_low_s16(int16x8_t a) a -> Vn.8H DUP Vd.1D,Vn.D[0] Vd.4H -> result v7/A32/A64 int32x2_t vget_low_s32(int32x4_t a) a -> Vn.4S DUP Vd.1D,Vn.D[0] Vd.2S -> result v7/A32/A64 @@ -2049,7 +2049,7 @@ float32x2_t vget_low_f32(float32x4_t a) a -> Vn.4S DUP Vd.1D,Vn.D[0] Vd.2S -> re poly8x8_t vget_low_p8(poly8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[0] Vd.8B -> result v7/A32/A64 poly16x4_t vget_low_p16(poly16x8_t a) a -> Vn.8H DUP Vd.1D,Vn.D[0] Vd.4H -> result v7/A32/A64 float64x1_t vget_low_f64(float64x2_t a) a -> Vn.2D DUP Vd.1D,Vn.D[0] Vd.1D -> result A64 -floatm8x8_t vget_low_fm8(floatm8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[0] Vd.8B -> result A64 +mfloat8x8_t vget_low_mf8(mfloat8x16_t a) a -> Vn.16B DUP Vd.1D,Vn.D[0] Vd.8B -> result A64 int8_t vdupb_lane_s8(int8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Bd,Vn.B[lane] Bd -> result A64 int16_t vduph_lane_s16(int16x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4H;0 <= lane <= 3 DUP Hd,Vn.H[lane] Hd -> result A64 int32_t vdups_lane_s32(int32x2_t vec, __builtin_constant_p(lane)) vec -> Vn.2S;0 <= lane <= 1 DUP Sd,Vn.S[lane] Sd -> result A64 @@ -2062,7 +2062,7 @@ float32_t vdups_lane_f32(float32x2_t vec, __builtin_constant_p(lane)) vec -> Vn. float64_t vdupd_lane_f64(float64x1_t vec, __builtin_constant_p(lane)) vec -> Vn.1D;0 <= lane <= 0 DUP Dd,Vn.D[lane] Dd -> result A64 poly8_t vdupb_lane_p8(poly8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Bd,Vn.B[lane] Bd -> result A64 poly16_t vduph_lane_p16(poly16x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4H;0 <= lane <= 3 DUP Hd,Vn.H[lane] Hd -> result A64 -floatm8_t vdupb_lane_fm8(floatm8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Bd,Vn.B[lane] Bd -> result A64 +mfloat8_t vdupb_lane_mf8(mfloat8x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8B;0 <= lane <= 7 DUP Bd,Vn.B[lane] Bd -> result A64 int8_t vdupb_laneq_s8(int8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Bd,Vn.B[lane] Bd -> result A64 int16_t vduph_laneq_s16(int16x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8H;0 <= lane <= 7 DUP Hd,Vn.H[lane] Hd -> result A64 int32_t vdups_laneq_s32(int32x4_t vec, __builtin_constant_p(lane)) vec -> Vn.4S;0 <= lane <= 3 DUP Sd,Vn.S[lane] Sd -> result A64 @@ -2075,7 +2075,7 @@ float32_t vdups_laneq_f32(float32x4_t vec, __builtin_constant_p(lane)) vec -> Vn float64_t vdupd_laneq_f64(float64x2_t vec, __builtin_constant_p(lane)) vec -> Vn.2D;0 <= lane <= 1 DUP Dd,Vn.D[lane] Dd -> result A64 poly8_t vdupb_laneq_p8(poly8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Bd,Vn.B[lane] Bd -> result A64 poly16_t vduph_laneq_p16(poly16x8_t vec, __builtin_constant_p(lane)) vec -> Vn.8H;0 <= lane <= 7 DUP Hd,Vn.H[lane] Hd -> result A64 -floatm8_t vdupb_laneq_fm8(floatm8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Bd,Vn.B[lane] Bd -> result A64 +mfloat8_t vdupb_laneq_mf8(mfloat8x16_t vec, __builtin_constant_p(lane)) vec -> Vn.16B;0 <= lane <= 15 DUP Bd,Vn.B[lane] Bd -> result A64 int8x8_t vld1_s8(int8_t const *ptr) ptr -> Xn LD1 {Vt.8B},[Xn] Vt.8B -> result v7/A32/A64 int8x16_t vld1q_s8(int8_t const *ptr) ptr -> Xn LD1 {Vt.16B},[Xn] Vt.16B -> result v7/A32/A64 int16x4_t vld1_s16(int16_t const *ptr) ptr -> Xn LD1 {Vt.4H},[Xn] Vt.4H -> result v7/A32/A64 @@ -2104,8 +2104,8 @@ poly16x4_t vld1_p16(poly16_t const *ptr) ptr -> Xn LD1 {Vt.4H},[Xn] Vt.4H -> res poly16x8_t vld1q_p16(poly16_t const *ptr) ptr -> Xn LD1 {Vt.8H},[Xn] Vt.8H -> result v7/A32/A64 float64x1_t vld1_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D},[Xn] Vt.1D -> result A64 float64x2_t vld1q_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.2D},[Xn] Vt.2D -> result A64 -floatm8x8_t vld1_fm8(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.8B},[Xn] Vt.8B -> result A64 -floatm8x16_t vld1q_fm8(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.16B},[Xn] Vt.16B -> result A64 +mfloat8x8_t vld1_mf8(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.8B},[Xn] Vt.8B -> result A64 +mfloat8x16_t vld1q_mf8(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.16B},[Xn] Vt.16B -> result A64 int8x8_t vld1_lane_s8(int8_t const *ptr, int8x8_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.8B;0 <= lane <= 7 LD1 {Vt.b}[lane],[Xn] Vt.8B -> result v7/A32/A64 int8x16_t vld1q_lane_s8(int8_t const *ptr, int8x16_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.16B;0 <= lane <= 15 LD1 {Vt.b}[lane],[Xn] Vt.16B -> result v7/A32/A64 int16x4_t vld1_lane_s16(int16_t const *ptr, int16x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.4H;0 <= lane <= 3 LD1 {Vt.H}[lane],[Xn] Vt.4H -> result v7/A32/A64 @@ -2134,8 +2134,8 @@ poly16x4_t vld1_lane_p16(poly16_t const *ptr, poly16x4_t src, __builtin_constant poly16x8_t vld1q_lane_p16(poly16_t const *ptr, poly16x8_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.8H;0 <= lane <= 7 LD1 {Vt.H}[lane],[Xn] Vt.8H -> result v7/A32/A64 float64x1_t vld1_lane_f64(float64_t const *ptr, float64x1_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.1D;0 <= lane <= 0 LD1 {Vt.D}[lane],[Xn] Vt.1D -> result A64 float64x2_t vld1q_lane_f64(float64_t const *ptr, float64x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.2D;0 <= lane <= 1 LD1 {Vt.D}[lane],[Xn] Vt.2D -> result A64 -floatm8x8_t vld1_lane_fm8(floatm8_t const *ptr, floatm8x8_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.8B;0 <= lane <= 7 LD1 {Vt.b}[lane],[Xn] Vt.8B -> result A64 -floatm8x16_t vld1q_lane_fm8(floatm8_t const *ptr, floatm8x16_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.16B;0 <= lane <= 15 LD1 {Vt.b}[lane],[Xn] Vt.16B -> result A64 +mfloat8x8_t vld1_lane_mf8(mfloat8_t const *ptr, mfloat8x8_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.8B;0 <= lane <= 7 LD1 {Vt.b}[lane],[Xn] Vt.8B -> result A64 +mfloat8x16_t vld1q_lane_mf8(mfloat8_t const *ptr, mfloat8x16_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.16B;0 <= lane <= 15 LD1 {Vt.b}[lane],[Xn] Vt.16B -> result A64 uint64x1_t vldap1_lane_u64(uint64_t const *ptr, uint64x1_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.1D;0 <= lane <= 0 LDAP1 {Vt.D}[lane],[Xn] Vt.1D -> result A64 uint64x2_t vldap1q_lane_u64(uint64_t const *ptr, uint64x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.2D;0 <= lane <= 1 LDAP1 {Vt.D}[lane],[Xn] Vt.2D -> result A64 int64x1_t vldap1_lane_s64(int64_t const *ptr, int64x1_t src, __builtin_constant_p(lane)) ptr -> Xn;src -> Vt.1D;0 <= lane <= 0 LDAP1 {Vt.D}[lane],[Xn] Vt.1D -> result A64 @@ -2172,8 +2172,8 @@ poly16x4_t vld1_dup_p16(poly16_t const *ptr) ptr -> Xn LD1R {Vt.4H},[Xn] Vt.4H - poly16x8_t vld1q_dup_p16(poly16_t const *ptr) ptr -> Xn LD1R {Vt.8H},[Xn] Vt.8H -> result v7/A32/A64 float64x1_t vld1_dup_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D},[Xn] Vt.1D -> result A64 float64x2_t vld1q_dup_f64(float64_t const *ptr) ptr -> Xn LD1R {Vt.2D},[Xn] Vt.2D -> result A64 -floatm8x8_t vld1_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD1R {Vt.8B},[Xn] Vt.8B -> result A64 -floatm8x16_t vld1q_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD1R {Vt.16B},[Xn] Vt.16B -> result A64 +mfloat8x8_t vld1_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD1R {Vt.8B},[Xn] Vt.8B -> result A64 +mfloat8x16_t vld1q_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD1R {Vt.16B},[Xn] Vt.16B -> result A64 void vst1_s8(int8_t *ptr, int8x8_t val) val -> Vt.8B;ptr -> Xn ST1 {Vt.8B},[Xn] v7/A32/A64 void vst1q_s8(int8_t *ptr, int8x16_t val) val -> Vt.16B;ptr -> Xn ST1 {Vt.16B},[Xn] v7/A32/A64 void vst1_s16(int16_t *ptr, int16x4_t val) val -> Vt.4H;ptr -> Xn ST1 {Vt.4H},[Xn] v7/A32/A64 @@ -2202,8 +2202,8 @@ void vst1_p16(poly16_t *ptr, poly16x4_t val) val -> Vt.4H;ptr -> Xn ST1 {Vt.4H}, void vst1q_p16(poly16_t *ptr, poly16x8_t val) val -> Vt.8H;ptr -> Xn ST1 {Vt.8H},[Xn] v7/A32/A64 void vst1_f64(float64_t *ptr, float64x1_t val) val -> Vt.1D;ptr -> Xn ST1 {Vt.1D},[Xn] A64 void vst1q_f64(float64_t *ptr, float64x2_t val) val -> Vt.2D;ptr -> Xn ST1 {Vt.2D},[Xn] A64 -void vst1_fm8(floatm8_t *ptr, floatm8x8_t val) val -> Vt.8B;ptr -> Xn ST1 {Vt.8B},[Xn] A64 -void vst1q_fm8(floatm8_t *ptr, floatm8x16_t val) val -> Vt.16B;ptr -> Xn ST1 {Vt.16B},[Xn] A64 +void vst1_mf8(mfloat8_t *ptr, mfloat8x8_t val) val -> Vt.8B;ptr -> Xn ST1 {Vt.8B},[Xn] A64 +void vst1q_mf8(mfloat8_t *ptr, mfloat8x16_t val) val -> Vt.16B;ptr -> Xn ST1 {Vt.16B},[Xn] A64 void vst1_lane_s8(int8_t *ptr, int8x8_t val, __builtin_constant_p(lane)) val -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST1 {Vt.b}[lane],[Xn] v7/A32/A64 void vst1q_lane_s8(int8_t *ptr, int8x16_t val, __builtin_constant_p(lane)) val -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST1 {Vt.b}[lane],[Xn] v7/A32/A64 void vst1_lane_s16(int16_t *ptr, int16x4_t val, __builtin_constant_p(lane)) val -> Vt.4H;ptr -> Xn;0 <= lane <= 3 ST1 {Vt.h}[lane],[Xn] v7/A32/A64 @@ -2240,8 +2240,8 @@ void vstl1_lane_f64(float64_t *ptr, float64x1_t val, __builtin_constant_p(lane)) void vstl1q_lane_f64(float64_t *ptr, float64x2_t val, __builtin_constant_p(lane)) val -> Vt.2D;ptr -> Xn;0 <= lane <= 1 STL1 {Vt.d}[lane],[Xn] A64 void vstl1_lane_p64(poly64_t *ptr, poly64x1_t val, __builtin_constant_p(lane)) val -> Vt.1D;ptr -> Xn;0 <= lane <= 0 STL1 {Vt.d}[lane],[Xn] A64 void vstl1q_lane_p64(poly64_t *ptr, poly64x2_t val, __builtin_constant_p(lane)) val -> Vt.2D;ptr -> Xn;0 <= lane <= 1 STL1 {Vt.d}[lane],[Xn] A64 -void vst1_lane_fm8(floatm8_t *ptr, floatm8x8_t val, __builtin_constant_p(lane)) val -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST1 {Vt.b}[lane],[Xn] A64 -void vst1q_lane_fm8(floatm8_t *ptr, floatm8x16_t val, __builtin_constant_p(lane)) val -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST1 {Vt.b}[lane],[Xn] A64 +void vst1_lane_mf8(mfloat8_t *ptr, mfloat8x8_t val, __builtin_constant_p(lane)) val -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST1 {Vt.b}[lane],[Xn] A64 +void vst1q_lane_mf8(mfloat8_t *ptr, mfloat8x16_t val, __builtin_constant_p(lane)) val -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST1 {Vt.b}[lane],[Xn] A64 int8x8x2_t vld2_s8(int8_t const *ptr) ptr -> Xn LD2 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x2_t vld2q_s8(int8_t const *ptr) ptr -> Xn LD2 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x2_t vld2_s16(int16_t const *ptr) ptr -> Xn LD2 {Vt.4H - Vt2.4H},[Xn] Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2270,8 +2270,8 @@ uint64x2x2_t vld2q_u64(uint64_t const *ptr) ptr -> Xn LD2 {Vt.2D - Vt2.2D},[Xn] poly64x2x2_t vld2q_p64(poly64_t const *ptr) ptr -> Xn LD2 {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x2_t vld2_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt2.1D},[Xn] Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x2_t vld2q_f64(float64_t const *ptr) ptr -> Xn LD2 {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x2_t vld2_fm8(floatm8_t const *ptr) ptr -> Xn LD2 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x2_t vld2q_fm8(floatm8_t const *ptr) ptr -> Xn LD2 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x2_t vld2_mf8(mfloat8_t const *ptr) ptr -> Xn LD2 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x2_t vld2q_mf8(mfloat8_t const *ptr) ptr -> Xn LD2 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x3_t vld3_s8(int8_t const *ptr) ptr -> Xn LD3 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x3_t vld3q_s8(int8_t const *ptr) ptr -> Xn LD3 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x3_t vld3_s16(int16_t const *ptr) ptr -> Xn LD3 {Vt.4H - Vt3.4H},[Xn] Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2300,8 +2300,8 @@ uint64x2x3_t vld3q_u64(uint64_t const *ptr) ptr -> Xn LD3 {Vt.2D - Vt3.2D},[Xn] poly64x2x3_t vld3q_p64(poly64_t const *ptr) ptr -> Xn LD3 {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x3_t vld3_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt3.1D},[Xn] Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x3_t vld3q_f64(float64_t const *ptr) ptr -> Xn LD3 {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x3_t vld3_fm8(int8_t const *ptr) ptr -> Xn LD3 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x3_t vld3q_fm8(int8_t const *ptr) ptr -> Xn LD3 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x3_t vld3_mf8(int8_t const *ptr) ptr -> Xn LD3 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x3_t vld3q_mf8(int8_t const *ptr) ptr -> Xn LD3 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x4_t vld4_s8(int8_t const *ptr) ptr -> Xn LD4 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x4_t vld4q_s8(int8_t const *ptr) ptr -> Xn LD4 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x4_t vld4_s16(int16_t const *ptr) ptr -> Xn LD4 {Vt.4H - Vt4.4H},[Xn] Vt4.4H -> result.val[3];Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2330,8 +2330,8 @@ uint64x2x4_t vld4q_u64(uint64_t const *ptr) ptr -> Xn LD4 {Vt.2D - Vt4.2D},[Xn] poly64x2x4_t vld4q_p64(poly64_t const *ptr) ptr -> Xn LD4 {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x4_t vld4_f64(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt4.1D},[Xn] Vt4.1D -> result.val[3];Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x4_t vld4q_f64(float64_t const *ptr) ptr -> Xn LD4 {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x4_t vld4_fm8(floatm8_t const *ptr) ptr -> Xn LD4 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x4_t vld4q_fm8(floatm8_t const *ptr) ptr -> Xn LD4 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x4_t vld4_mf8(mfloat8_t const *ptr) ptr -> Xn LD4 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x4_t vld4q_mf8(mfloat8_t const *ptr) ptr -> Xn LD4 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x2_t vld2_dup_s8(int8_t const *ptr) ptr -> Xn LD2R {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x2_t vld2q_dup_s8(int8_t const *ptr) ptr -> Xn LD2R {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x2_t vld2_dup_s16(int16_t const *ptr) ptr -> Xn LD2R {Vt.4H - Vt2.4H},[Xn] Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2360,8 +2360,8 @@ uint64x2x2_t vld2q_dup_u64(uint64_t const *ptr) ptr -> Xn LD2R {Vt.2D - Vt2.2D}, poly64x2x2_t vld2q_dup_p64(poly64_t const *ptr) ptr -> Xn LD2R {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x2_t vld2_dup_f64(float64_t const *ptr) ptr -> Xn LD2R {Vt.1D - Vt2.1D},[Xn] Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x2_t vld2q_dup_f64(float64_t const *ptr) ptr -> Xn LD2R {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x2_t vld2_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD2R {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x2_t vld2q_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD2R {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x2_t vld2_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD2R {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x2_t vld2q_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD2R {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x3_t vld3_dup_s8(int8_t const *ptr) ptr -> Xn LD3R {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x3_t vld3q_dup_s8(int8_t const *ptr) ptr -> Xn LD3R {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x3_t vld3_dup_s16(int16_t const *ptr) ptr -> Xn LD3R {Vt.4H - Vt3.4H},[Xn] Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2390,8 +2390,8 @@ uint64x2x3_t vld3q_dup_u64(uint64_t const *ptr) ptr -> Xn LD3R {Vt.2D - Vt3.2D}, poly64x2x3_t vld3q_dup_p64(poly64_t const *ptr) ptr -> Xn LD3R {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x3_t vld3_dup_f64(float64_t const *ptr) ptr -> Xn LD3R {Vt.1D - Vt3.1D},[Xn] Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x3_t vld3q_dup_f64(float64_t const *ptr) ptr -> Xn LD3R {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x3_t vld3_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD3R {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x3_t vld3q_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD3R {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x3_t vld3_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD3R {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x3_t vld3q_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD3R {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x4_t vld4_dup_s8(int8_t const *ptr) ptr -> Xn LD4R {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x4_t vld4q_dup_s8(int8_t const *ptr) ptr -> Xn LD4R {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x4_t vld4_dup_s16(int16_t const *ptr) ptr -> Xn LD4R {Vt.4H - Vt4.4H},[Xn] Vt4.4H -> result.val[3];Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2420,8 +2420,8 @@ uint64x2x4_t vld4q_dup_u64(uint64_t const *ptr) ptr -> Xn LD4R {Vt.2D - Vt4.2D}, poly64x2x4_t vld4q_dup_p64(poly64_t const *ptr) ptr -> Xn LD4R {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x4_t vld4_dup_f64(float64_t const *ptr) ptr -> Xn LD4R {Vt.1D - Vt4.1D},[Xn] Vt4.1D -> result.val[3];Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x4_t vld4q_dup_f64(float64_t const *ptr) ptr -> Xn LD4R {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x4_t vld4_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD4R {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x4_t vld4q_dup_fm8(floatm8_t const *ptr) ptr -> Xn LD4R {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x4_t vld4_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD4R {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x4_t vld4q_dup_mf8(mfloat8_t const *ptr) ptr -> Xn LD4R {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 void vst2_s8(int8_t *ptr, int8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST2 {Vt.8B - Vt2.8B},[Xn] v7/A32/A64 void vst2q_s8(int8_t *ptr, int8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST2 {Vt.16B - Vt2.16B},[Xn] v7/A32/A64 void vst2_s16(int16_t *ptr, int16x4x2_t val) val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST2 {Vt.4H - Vt2.4H},[Xn] v7/A32/A64 @@ -2450,8 +2450,8 @@ void vst2q_u64(uint64_t *ptr, uint64x2x2_t val) val.val[1] -> Vt2.2D;val.val[0] void vst2q_p64(poly64_t *ptr, poly64x2x2_t val) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST2 {Vt.2D - Vt2.2D},[Xn] A64 void vst2_f64(float64_t *ptr, float64x1x2_t val) val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt2.1D},[Xn] A64 void vst2q_f64(float64_t *ptr, float64x2x2_t val) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST2 {Vt.2D - Vt2.2D},[Xn] A64 -void vst2_fm8(floatm8_t *ptr, floatm8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST2 {Vt.8B - Vt2.8B},[Xn] A64 -void vst2q_fm8(floatm8_t *ptr, floatm8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST2 {Vt.16B - Vt2.16B},[Xn] A64 +void vst2_mf8(mfloat8_t *ptr, mfloat8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST2 {Vt.8B - Vt2.8B},[Xn] A64 +void vst2q_mf8(mfloat8_t *ptr, mfloat8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST2 {Vt.16B - Vt2.16B},[Xn] A64 void vst3_s8(int8_t *ptr, int8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST3 {Vt.8B - Vt3.8B},[Xn] v7/A32/A64 void vst3q_s8(int8_t *ptr, int8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST3 {Vt.16B - Vt3.16B},[Xn] v7/A32/A64 void vst3_s16(int16_t *ptr, int16x4x3_t val) val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST3 {Vt.4H - Vt3.4H},[Xn] v7/A32/A64 @@ -2480,8 +2480,8 @@ void vst3q_u64(uint64_t *ptr, uint64x2x3_t val) val.val[2] -> Vt3.2D;val.val[1] void vst3q_p64(poly64_t *ptr, poly64x2x3_t val) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST3 {Vt.2D - Vt3.2D},[Xn] A64 void vst3_f64(float64_t *ptr, float64x1x3_t val) val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt3.1D},[Xn] A64 void vst3q_f64(float64_t *ptr, float64x2x3_t val) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST3 {Vt.2D - Vt3.2D},[Xn] A64 -void vst3_fm8(floatm8_t *ptr, floatm8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST3 {Vt.8B - Vt3.8B},[Xn] A64 -void vst3q_fm8(floatm8_t *ptr, floatm8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST3 {Vt.16B - Vt3.16B},[Xn] A64 +void vst3_mf8(mfloat8_t *ptr, mfloat8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST3 {Vt.8B - Vt3.8B},[Xn] A64 +void vst3q_mf8(mfloat8_t *ptr, mfloat8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST3 {Vt.16B - Vt3.16B},[Xn] A64 void vst4_s8(int8_t *ptr, int8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST4 {Vt.8B - Vt4.8B},[Xn] v7/A32/A64 void vst4q_s8(int8_t *ptr, int8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST4 {Vt.16B - Vt4.16B},[Xn] v7/A32/A64 void vst4_s16(int16_t *ptr, int16x4x4_t val) val.val[3] -> Vt4.4H;val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST4 {Vt.4H - Vt4.4H},[Xn] v7/A32/A64 @@ -2510,8 +2510,8 @@ void vst4q_u64(uint64_t *ptr, uint64x2x4_t val) val.val[3] -> Vt4.2D;val.val[2] void vst4q_p64(poly64_t *ptr, poly64x2x4_t val) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST4 {Vt.2D - Vt4.2D},[Xn] A64 void vst4_f64(float64_t *ptr, float64x1x4_t val) val.val[3] -> Vt4.1D;val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt4.1D},[Xn] A64 void vst4q_f64(float64_t *ptr, float64x2x4_t val) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST4 {Vt.2D - Vt4.2D},[Xn] A64 -void vst4_fm8(floatm8_t *ptr, floatm8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST4 {Vt.8B - Vt4.8B},[Xn] A64 -void vst4q_fm8(floatm8_t *ptr, floatm8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST4 {Vt.16B - Vt4.16B},[Xn] A64 +void vst4_mf8(mfloat8_t *ptr, mfloat8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST4 {Vt.8B - Vt4.8B},[Xn] A64 +void vst4q_mf8(mfloat8_t *ptr, mfloat8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST4 {Vt.16B - Vt4.16B},[Xn] A64 int16x4x2_t vld2_lane_s16(int16_t const *ptr, int16x4x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.4H;src.val[0] -> Vt.4H;0 <= lane <= 3 LD2 {Vt.h - Vt2.h}[lane],[Xn] Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 int16x8x2_t vld2q_lane_s16(int16_t const *ptr, int16x8x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.8H;src.val[0] -> Vt.8H;0 <= lane <= 7 LD2 {Vt.h - Vt2.h}[lane],[Xn] Vt2.8H -> result.val[1];Vt.8H -> result.val[0] v7/A32/A64 int32x2x2_t vld2_lane_s32(int32_t const *ptr, int32x2x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.2S;src.val[0] -> Vt.2S;0 <= lane <= 1 LD2 {Vt.s - Vt2.s}[lane],[Xn] Vt2.2S -> result.val[1];Vt.2S -> result.val[0] v7/A32/A64 @@ -2540,8 +2540,8 @@ poly64x1x2_t vld2_lane_p64(poly64_t const *ptr, poly64x1x2_t src, __builtin_cons poly64x2x2_t vld2q_lane_p64(poly64_t const *ptr, poly64x2x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD2 {Vt.d - Vt2.d}[lane],[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x2_t vld2_lane_f64(float64_t const *ptr, float64x1x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.1D;src.val[0] -> Vt.1D;0 <= lane <= 0 LD2 {Vt.d - Vt2.d}[lane],[Xn] Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x2_t vld2q_lane_f64(float64_t const *ptr, float64x2x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD2 {Vt.d - Vt2.d}[lane],[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x2_t vld2_lane_fm8(floatm8_t const *ptr, floatm8x8x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD2 {Vt.b - Vt2.b}[lane],[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x2_t vld2q_lane_fm8(floatm8_t const *ptr, floatm8x16x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD2 {Vt.b - Vt2.b}[lane],[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x2_t vld2_lane_mf8(mfloat8_t const *ptr, mfloat8x8x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD2 {Vt.b - Vt2.b}[lane],[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x2_t vld2q_lane_mf8(mfloat8_t const *ptr, mfloat8x16x2_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD2 {Vt.b - Vt2.b}[lane],[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int16x4x3_t vld3_lane_s16(int16_t const *ptr, int16x4x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.4H;src.val[1] -> Vt2.4H;src.val[0] -> Vt.4H;0 <= lane <= 3 LD3 {Vt.h - Vt3.h}[lane],[Xn] Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 int16x8x3_t vld3q_lane_s16(int16_t const *ptr, int16x8x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.8H;src.val[1] -> Vt2.8H;src.val[0] -> Vt.8H;0 <= lane <= 7 LD3 {Vt.h - Vt3.h}[lane],[Xn] Vt3.8H -> result.val[2];Vt2.8H -> result.val[1];Vt.8H -> result.val[0] v7/A32/A64 int32x2x3_t vld3_lane_s32(int32_t const *ptr, int32x2x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.2S;src.val[1] -> Vt2.2S;src.val[0] -> Vt.2S;0 <= lane <= 1 LD3 {Vt.s - Vt3.s}[lane],[Xn] Vt3.2S -> result.val[2];Vt2.2S -> result.val[1];Vt.2S -> result.val[0] v7/A32/A64 @@ -2570,8 +2570,8 @@ poly64x1x3_t vld3_lane_p64(poly64_t const *ptr, poly64x1x3_t src, __builtin_cons poly64x2x3_t vld3q_lane_p64(poly64_t const *ptr, poly64x2x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.2D;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD3 {Vt.d - Vt3.d}[lane],[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x3_t vld3_lane_f64(float64_t const *ptr, float64x1x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.1D;src.val[1] -> Vt2.1D;src.val[0] -> Vt.1D;0 <= lane <= 0 LD3 {Vt.d - Vt3.d}[lane],[Xn] Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x3_t vld3q_lane_f64(float64_t const *ptr, float64x2x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.2D;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD3 {Vt.d - Vt3.d}[lane],[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x3_t vld3_lane_fm8(floatm8_t const *ptr, floatm8x8x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.8B;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD3 {Vt.b - Vt3.b}[lane],[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x3_t vld3q_lane_fm8(floatm8_t const *ptr, floatm8x16x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.16B;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD3 {Vt.b - Vt3.b}[lane],[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x3_t vld3_lane_mf8(mfloat8_t const *ptr, mfloat8x8x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.8B;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD3 {Vt.b - Vt3.b}[lane],[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x3_t vld3q_lane_mf8(mfloat8_t const *ptr, mfloat8x16x3_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[2] -> Vt3.16B;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD3 {Vt.b - Vt3.b}[lane],[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int16x4x4_t vld4_lane_s16(int16_t const *ptr, int16x4x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.4H;src.val[2] -> Vt3.4H;src.val[1] -> Vt2.4H;src.val[0] -> Vt.4H;0 <= lane <= 3 LD4 {Vt.h - Vt4.h}[lane],[Xn] Vt4.4H -> result.val[3];Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 int16x8x4_t vld4q_lane_s16(int16_t const *ptr, int16x8x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.8H;src.val[2] -> Vt3.8H;src.val[1] -> Vt2.8H;src.val[0] -> Vt.8H;0 <= lane <= 7 LD4 {Vt.h - Vt4.h}[lane],[Xn] Vt4.8H -> result.val[3];Vt3.8H -> result.val[2];Vt2.8H -> result.val[1];Vt.8H -> result.val[0] v7/A32/A64 int32x2x4_t vld4_lane_s32(int32_t const *ptr, int32x2x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.2S;src.val[2] -> Vt3.2S;src.val[1] -> Vt2.2S;src.val[0] -> Vt.2S;0 <= lane <= 1 LD4 {Vt.s - Vt4.s}[lane],[Xn] Vt4.2S -> result.val[3];Vt3.2S -> result.val[2];Vt2.2S -> result.val[1];Vt.2S -> result.val[0] v7/A32/A64 @@ -2600,8 +2600,8 @@ poly64x1x4_t vld4_lane_p64(poly64_t const *ptr, poly64x1x4_t src, __builtin_cons poly64x2x4_t vld4q_lane_p64(poly64_t const *ptr, poly64x2x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.2D;src.val[2] -> Vt3.2D;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD4 {Vt.d - Vt4.d}[lane],[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 float64x1x4_t vld4_lane_f64(float64_t const *ptr, float64x1x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.1D;src.val[2] -> Vt3.1D;src.val[1] -> Vt2.1D;src.val[0] -> Vt.1D;0 <= lane <= 0 LD4 {Vt.d - Vt4.d}[lane],[Xn] Vt4.1D -> result.val[3];Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x4_t vld4q_lane_f64(float64_t const *ptr, float64x2x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.2D;src.val[2] -> Vt3.2D;src.val[1] -> Vt2.2D;src.val[0] -> Vt.2D;0 <= lane <= 1 LD4 {Vt.d - Vt4.d}[lane],[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x4_t vld4_lane_fm8(floatm8_t const *ptr, floatm8x8x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.8B;src.val[2] -> Vt3.8B;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD4 {Vt.b - Vt4.b}[lane],[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x4_t vld4q_lane_fm8(floatm8_t const *ptr, floatm8x16x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.16B;src.val[2] -> Vt3.16B;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD4 {Vt.b - Vt4.b}[lane],[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x4_t vld4_lane_mf8(mfloat8_t const *ptr, mfloat8x8x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.8B;src.val[2] -> Vt3.8B;src.val[1] -> Vt2.8B;src.val[0] -> Vt.8B;0 <= lane <= 7 LD4 {Vt.b - Vt4.b}[lane],[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x4_t vld4q_lane_mf8(mfloat8_t const *ptr, mfloat8x16x4_t src, __builtin_constant_p(lane)) ptr -> Xn;src.val[3] -> Vt4.16B;src.val[2] -> Vt3.16B;src.val[1] -> Vt2.16B;src.val[0] -> Vt.16B;0 <= lane <= 15 LD4 {Vt.b - Vt4.b}[lane],[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 void vst2_lane_s8(int8_t *ptr, int8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] v7/A32/A64 void vst2_lane_u8(uint8_t *ptr, uint8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] v7/A32/A64 void vst2_lane_p8(poly8_t *ptr, poly8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] v7/A32/A64 @@ -2611,9 +2611,9 @@ void vst3_lane_p8(poly8_t *ptr, poly8x8x3_t val, __builtin_constant_p(lane)) val void vst4_lane_s8(int8_t *ptr, int8x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.b - Vt4.b}[lane],[Xn] v7/A32/A64 void vst4_lane_u8(uint8_t *ptr, uint8x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.b - Vt4.b}[lane],[Xn] v7/A32/A64 void vst4_lane_p8(poly8_t *ptr, poly8x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.b - Vt4.b}[lane],[Xn] v7/A32/A64 -void vst2_lane_fm8(floatm8_t *ptr, floatm8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 -void vst3_lane_fm8(floatm8_t *ptr, floatm8x8x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST3 {Vt.b - Vt3.b}[lane],[Xn] A64 -void vst4_lane_fm8(floatm8_t *ptr, floatm8x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.b - Vt4.b}[lane],[Xn] A64 +void vst2_lane_mf8(mfloat8_t *ptr, mfloat8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 +void vst3_lane_mf8(mfloat8_t *ptr, mfloat8x8x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST3 {Vt.b - Vt3.b}[lane],[Xn] A64 +void vst4_lane_mf8(mfloat8_t *ptr, mfloat8x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.b - Vt4.b}[lane],[Xn] A64 void vst2_lane_s16(int16_t *ptr, int16x4x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn;0 <= lane <= 3 ST2 {Vt.h - Vt2.h}[lane],[Xn] v7/A32/A64 void vst2q_lane_s16(int16_t *ptr, int16x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8H;val.val[0] -> Vt.8H;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.h - Vt2.h}[lane],[Xn] v7/A32/A64 void vst2_lane_s32(int32_t *ptr, int32x2x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.2S;val.val[0] -> Vt.2S;ptr -> Xn;0 <= lane <= 1 ST2 {Vt.s - Vt2.s}[lane],[Xn] v7/A32/A64 @@ -2639,8 +2639,8 @@ void vst2_lane_p64(poly64_t *ptr, poly64x1x2_t val, __builtin_constant_p(lane)) void vst2q_lane_p64(poly64_t *ptr, poly64x2x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 1 ST2 {Vt.d - Vt2.d}[lane],[Xn] A64 void vst2_lane_f64(float64_t *ptr, float64x1x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn;0 <= lane <= 0 ST2 {Vt.d - Vt2.d}[lane],[Xn] A64 void vst2q_lane_f64(float64_t *ptr, float64x2x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 2 ST2 {Vt.d - Vt2.d}[lane],[Xn] A64 -void vst2_lane_fm8(floatm8_t *ptr, floatm8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 -void vst2q_lane_fm8(floatm8_t *ptr, floatm8x16x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 +void vst2_lane_mf8(mfloat8_t *ptr, mfloat8x8x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn;0 <= lane <= 7 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 +void vst2q_lane_mf8(mfloat8_t *ptr, mfloat8x16x2_t val, __builtin_constant_p(lane)) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST2 {Vt.b - Vt2.b}[lane],[Xn] A64 void vst3_lane_s16(int16_t *ptr, int16x4x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn;0 <= lane <= 3 ST3 {Vt.h - Vt3.h}[lane],[Xn] v7/A32/A64 void vst3q_lane_s16(int16_t *ptr, int16x8x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.8H;val.val[1] -> Vt2.8H;val.val[0] -> Vt.8H;ptr -> Xn;0 <= lane <= 7 ST3 {Vt.h - Vt3.h}[lane],[Xn] v7/A32/A64 void vst3_lane_s32(int32_t *ptr, int32x2x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.2S;val.val[1] -> Vt2.2S;val.val[0] -> Vt.2S;ptr -> Xn;0 <= lane <= 1 ST3 {Vt.s - Vt3.s}[lane],[Xn] v7/A32/A64 @@ -2666,7 +2666,7 @@ void vst3_lane_p64(poly64_t *ptr, poly64x1x3_t val, __builtin_constant_p(lane)) void vst3q_lane_p64(poly64_t *ptr, poly64x2x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 1 ST3 {Vt.d - Vt3.d}[lane],[Xn] A64 void vst3_lane_f64(float64_t *ptr, float64x1x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn;0 <= lane <= 0 ST3 {Vt.d - Vt3.d}[lane],[Xn] A64 void vst3q_lane_f64(float64_t *ptr, float64x2x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 1 ST3 {Vt.d - Vt3.d}[lane],[Xn] A64 -void vst3q_lane_fm8(floatm8_t *ptr, floatm8x16x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST3 {Vt.b - Vt3.b}[lane],[Xn] A64 +void vst3q_lane_mf8(mfloat8_t *ptr, mfloat8x16x3_t val, __builtin_constant_p(lane)) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST3 {Vt.b - Vt3.b}[lane],[Xn] A64 void vst4_lane_s16(int16_t *ptr, int16x4x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.4H;val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn;0 <= lane <= 3 ST4 {Vt.h - Vt4.h}[lane],[Xn] v7/A32/A64 void vst4q_lane_s16(int16_t *ptr, int16x8x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.8H;val.val[2] -> Vt3.8H;val.val[1] -> Vt2.8H;val.val[0] -> Vt.8H;ptr -> Xn;0 <= lane <= 7 ST4 {Vt.h - Vt4.h}[lane],[Xn] v7/A32/A64 void vst4_lane_s32(int32_t *ptr, int32x2x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.2S;val.val[2] -> Vt3.2S;val.val[1] -> Vt2.2S;val.val[0] -> Vt.2S;ptr -> Xn;0 <= lane <= 1 ST4 {Vt.s - Vt4.s}[lane],[Xn] v7/A32/A64 @@ -2692,7 +2692,7 @@ void vst4_lane_p64(poly64_t *ptr, poly64x1x4_t val, __builtin_constant_p(lane)) void vst4q_lane_p64(poly64_t *ptr, poly64x2x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 1 ST4 {Vt.d - Vt4.d}[lane],[Xn] A64 void vst4_lane_f64(float64_t *ptr, float64x1x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.1D;val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn;0 <= lane <= 0 ST4 {Vt.d - Vt4.d}[lane],[Xn] A64 void vst4q_lane_f64(float64_t *ptr, float64x2x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn;0 <= lane <= 1 ST4 {Vt.d - Vt4.d}[lane],[Xn] A64 -void vst4q_lane_fm8(floatm8_t *ptr, floatm8x16x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST4 {Vt.b - Vt4.b}[lane],[Xn] A64 +void vst4q_lane_mf8(mfloat8_t *ptr, mfloat8x16x4_t val, __builtin_constant_p(lane)) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn;0 <= lane <= 15 ST4 {Vt.b - Vt4.b}[lane],[Xn] A64 void vst1_s8_x2(int8_t *ptr, int8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt2.8B},[Xn] v7/A32/A64 void vst1q_s8_x2(int8_t *ptr, int8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt2.16B},[Xn] v7/A32/A64 void vst1_s16_x2(int16_t *ptr, int16x4x2_t val) val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST1 {Vt.4H - Vt2.4H},[Xn] v7/A32/A64 @@ -2721,8 +2721,8 @@ void vst1q_u64_x2(uint64_t *ptr, uint64x2x2_t val) val.val[1] -> Vt2.2D;val.val[ void vst1q_p64_x2(poly64_t *ptr, poly64x2x2_t val) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt2.2D},[Xn] A32/A64 void vst1_f64_x2(float64_t *ptr, float64x1x2_t val) val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt2.1D},[Xn] A64 void vst1q_f64_x2(float64_t *ptr, float64x2x2_t val) val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt2.2D},[Xn] A64 -void vst1_fm8_x2(floatm8_t *ptr, floatm8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt2.8B},[Xn] A64 -void vst1q_fm8_x2(floatm8_t *ptr, floatm8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt2.16B},[Xn] A64 +void vst1_mf8_x2(mfloat8_t *ptr, mfloat8x8x2_t val) val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt2.8B},[Xn] A64 +void vst1q_mf8_x2(mfloat8_t *ptr, mfloat8x16x2_t val) val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt2.16B},[Xn] A64 void vst1_s8_x3(int8_t *ptr, int8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt3.8B},[Xn] v7/A32/A64 void vst1q_s8_x3(int8_t *ptr, int8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt3.16B},[Xn] v7/A32/A64 void vst1_s16_x3(int16_t *ptr, int16x4x3_t val) val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST1 {Vt.4H - Vt3.4H},[Xn] v7/A32/A64 @@ -2751,8 +2751,8 @@ void vst1q_u64_x3(uint64_t *ptr, uint64x2x3_t val) val.val[2] -> Vt3.2D;val.val[ void vst1q_p64_x3(poly64_t *ptr, poly64x2x3_t val) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt3.2D},[Xn] v7/A32/A64 void vst1_f64_x3(float64_t *ptr, float64x1x3_t val) val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt3.1D},[Xn] A64 void vst1q_f64_x3(float64_t *ptr, float64x2x3_t val) val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt3.2D},[Xn] A64 -void vst1_fm8_x3(floatm8_t *ptr, floatm8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt3.8B},[Xn] A64 -void vst1q_fm8_x3(floatm8_t *ptr, floatm8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt3.16B},[Xn] A64 +void vst1_mf8_x3(mfloat8_t *ptr, mfloat8x8x3_t val) val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt3.8B},[Xn] A64 +void vst1q_mf8_x3(mfloat8_t *ptr, mfloat8x16x3_t val) val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt3.16B},[Xn] A64 void vst1_s8_x4(int8_t *ptr, int8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt4.8B},[Xn] v7/A32/A64 void vst1q_s8_x4(int8_t *ptr, int8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt4.16B},[Xn] v7/A32/A64 void vst1_s16_x4(int16_t *ptr, int16x4x4_t val) val.val[3] -> Vt4.4H;val.val[2] -> Vt3.4H;val.val[1] -> Vt2.4H;val.val[0] -> Vt.4H;ptr -> Xn ST1 {Vt.4H - Vt4.4H},[Xn] v7/A32/A64 @@ -2781,8 +2781,8 @@ void vst1q_u64_x4(uint64_t *ptr, uint64x2x4_t val) val.val[3] -> Vt4.2D;val.val[ void vst1q_p64_x4(poly64_t *ptr, poly64x2x4_t val) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt4.2D},[Xn] A32/A64 void vst1_f64_x4(float64_t *ptr, float64x1x4_t val) val.val[3] -> Vt4.1D;val.val[2] -> Vt3.1D;val.val[1] -> Vt2.1D;val.val[0] -> Vt.1D;ptr -> Xn ST1 {Vt.1D - Vt4.1D},[Xn] A64 void vst1q_f64_x4(float64_t *ptr, float64x2x4_t val) val.val[3] -> Vt4.2D;val.val[2] -> Vt3.2D;val.val[1] -> Vt2.2D;val.val[0] -> Vt.2D;ptr -> Xn ST1 {Vt.2D - Vt4.2D},[Xn] A64 -void vst1_fm8_x4(int8_t *ptr, int8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt4.8B},[Xn] v7/A32/A64 -void vst1q_fm8_x4(int8_t *ptr, int8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt4.16B},[Xn] v7/A32/A64 +void vst1_mf8_x4(int8_t *ptr, int8x8x4_t val) val.val[3] -> Vt4.8B;val.val[2] -> Vt3.8B;val.val[1] -> Vt2.8B;val.val[0] -> Vt.8B;ptr -> Xn ST1 {Vt.8B - Vt4.8B},[Xn] v7/A32/A64 +void vst1q_mf8_x4(int8_t *ptr, int8x16x4_t val) val.val[3] -> Vt4.16B;val.val[2] -> Vt3.16B;val.val[1] -> Vt2.16B;val.val[0] -> Vt.16B;ptr -> Xn ST1 {Vt.16B - Vt4.16B},[Xn] v7/A32/A64 int8x8x2_t vld1_s8_x2(int8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x2_t vld1q_s8_x2(int8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x2_t vld1_s16_x2(int16_t const *ptr) ptr -> Xn LD1 {Vt.4H - Vt2.4H},[Xn] Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2811,8 +2811,8 @@ uint64x2x2_t vld1q_u64_x2(uint64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt2.2D},[X poly64x2x2_t vld1q_p64_x2(poly64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A32/A64 float64x1x2_t vld1_f64_x2(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt2.1D},[Xn] Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x2_t vld1q_f64_x2(float64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt2.2D},[Xn] Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x2_t vld1_fm8_x2(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x2_t vld1q_fm8_x2(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x2_t vld1_mf8_x2(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt2.8B},[Xn] Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x2_t vld1q_mf8_x2(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt2.16B},[Xn] Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x3_t vld1_s8_x3(int8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x3_t vld1q_s8_x3(int8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x3_t vld1_s16_x3(int16_t const *ptr) ptr -> Xn LD1 {Vt.4H - Vt3.4H},[Xn] Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2841,8 +2841,8 @@ uint64x2x3_t vld1q_u64_x3(uint64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt3.2D},[X poly64x2x3_t vld1q_p64_x3(poly64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A32/A64 float64x1x3_t vld1_f64_x3(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt3.1D},[Xn] Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x3_t vld1q_f64_x3(float64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt3.2D},[Xn] Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x3_t vld1_fm8_x3(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x3_t vld1q_fm8_x3(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x3_t vld1_mf8_x3(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt3.8B},[Xn] Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x3_t vld1q_mf8_x3(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt3.16B},[Xn] Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8x4_t vld1_s8_x4(int8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] v7/A32/A64 int8x16x4_t vld1q_s8_x4(int8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] v7/A32/A64 int16x4x4_t vld1_s16_x4(int16_t const *ptr) ptr -> Xn LD1 {Vt.4H - Vt4.4H},[Xn] Vt4.4H -> result.val[3];Vt3.4H -> result.val[2];Vt2.4H -> result.val[1];Vt.4H -> result.val[0] v7/A32/A64 @@ -2871,8 +2871,8 @@ uint64x2x4_t vld1q_u64_x4(uint64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt4.2D},[X poly64x2x4_t vld1q_p64_x4(poly64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A32/A64 float64x1x4_t vld1_f64_x4(float64_t const *ptr) ptr -> Xn LD1 {Vt.1D - Vt4.1D},[Xn] Vt4.1D -> result.val[3];Vt3.1D -> result.val[2];Vt2.1D -> result.val[1];Vt.1D -> result.val[0] A64 float64x2x4_t vld1q_f64_x4(float64_t const *ptr) ptr -> Xn LD1 {Vt.2D - Vt4.2D},[Xn] Vt4.2D -> result.val[3];Vt3.2D -> result.val[2];Vt2.2D -> result.val[1];Vt.2D -> result.val[0] A64 -floatm8x8x4_t vld1_fm8_x4(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 -floatm8x16x4_t vld1q_fm8_x4(floatm8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 +mfloat8x8x4_t vld1_mf8_x4(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.8B - Vt4.8B},[Xn] Vt4.8B -> result.val[3];Vt3.8B -> result.val[2];Vt2.8B -> result.val[1];Vt.8B -> result.val[0] A64 +mfloat8x16x4_t vld1q_mf8_x4(mfloat8_t const *ptr) ptr -> Xn LD1 {Vt.16B - Vt4.16B},[Xn] Vt4.16B -> result.val[3];Vt3.16B -> result.val[2];Vt2.16B -> result.val[1];Vt.16B -> result.val[0] A64 int8x8_t vpadd_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B ADDP Vd.8B,Vn.8B,Vm.8B Vd.8B -> result v7/A32/A64 int16x4_t vpadd_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H ADDP Vd.4H,Vn.4H,Vm.4H Vd.4H -> result v7/A32/A64 int32x2_t vpadd_s32(int32x2_t a, int32x2_t b) a -> Vn.2S;b -> Vm.2S ADDP Vd.2S,Vn.2S,Vm.2S Vd.2S -> result v7/A32/A64 @@ -3053,8 +3053,8 @@ poly8x8_t vext_p8(poly8x8_t a, poly8x8_t b, __builtin_constant_p(n)) a -> Vn.8B; poly8x16_t vextq_p8(poly8x16_t a, poly8x16_t b, __builtin_constant_p(n)) a -> Vn.16B;b -> Vm.16B;0 <= n <= 15 EXT Vd.16B,Vn.16B,Vm.16B,#n Vd.16B -> result v7/A32/A64 poly16x4_t vext_p16(poly16x4_t a, poly16x4_t b, __builtin_constant_p(n)) a -> Vn.8B;b -> Vm.8B;0 <= n <= 3 EXT Vd.8B,Vn.8B,Vm.8B,#(n<<1) Vd.8B -> result v7/A32/A64 poly16x8_t vextq_p16(poly16x8_t a, poly16x8_t b, __builtin_constant_p(n)) a -> Vn.16B;b -> Vm.16B;0 <= n <= 7 EXT Vd.16B,Vn.16B,Vm.16B,#(n<<1) Vd.16B -> result v7/A32/A64 -floatm8x8_t vext_fm8(floatm8x8_t a, floatm8x8_t b, __builtin_constant_p(n)) a -> Vn.8B;b -> Vm.8B;0 <= n <= 7 EXT Vd.8B,Vn.8B,Vm.8B,#n Vd.8B -> result A64 -floatm8x16_t vextq_fm8(floatm8x16_t a, floatm8x16_t b, __builtin_constant_p(n)) a -> Vn.16B;b -> Vm.16B;0 <= n <= 15 EXT Vd.16B,Vn.16B,Vm.16B,#n Vd.16B -> result A64 +mfloat8x8_t vext_mf8(mfloat8x8_t a, mfloat8x8_t b, __builtin_constant_p(n)) a -> Vn.8B;b -> Vm.8B;0 <= n <= 7 EXT Vd.8B,Vn.8B,Vm.8B,#n Vd.8B -> result A64 +mfloat8x16_t vextq_mf8(mfloat8x16_t a, mfloat8x16_t b, __builtin_constant_p(n)) a -> Vn.16B;b -> Vm.16B;0 <= n <= 15 EXT Vd.16B,Vn.16B,Vm.16B,#n Vd.16B -> result A64 int8x8_t vrev64_s8(int8x8_t vec) vec -> Vn.8B REV64 Vd.8B,Vn.8B Vd.8B -> result v7/A32/A64 int8x16_t vrev64q_s8(int8x16_t vec) vec -> Vn.16B REV64 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 int16x4_t vrev64_s16(int16x4_t vec) vec -> Vn.4H REV64 Vd.4H,Vn.4H Vd.4H -> result v7/A32/A64 @@ -3073,8 +3073,8 @@ poly8x8_t vrev64_p8(poly8x8_t vec) vec -> Vn.8B REV64 Vd.8B,Vn.8B Vd.8B -> resul poly8x16_t vrev64q_p8(poly8x16_t vec) vec -> Vn.16B REV64 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 poly16x4_t vrev64_p16(poly16x4_t vec) vec -> Vn.4H REV64 Vd.4H,Vn.4H Vd.4H -> result v7/A32/A64 poly16x8_t vrev64q_p16(poly16x8_t vec) vec -> Vn.8H REV64 Vd.8H,Vn.8H Vd.8H -> result v7/A32/A64 -floatm8x8_t vrev64_fm8(floatm8x8_t vec) vec -> Vn.8B REV64 Vd.8B,Vn.8B Vd.8B -> result A64 -floatm8x16_t vrev64q_fm8(floatm8x16_t vec) vec -> Vn.16B REV64 Vd.16B,Vn.16B Vd.16B -> result A64 +mfloat8x8_t vrev64_mf8(mfloat8x8_t vec) vec -> Vn.8B REV64 Vd.8B,Vn.8B Vd.8B -> result A64 +mfloat8x16_t vrev64q_mf8(mfloat8x16_t vec) vec -> Vn.16B REV64 Vd.16B,Vn.16B Vd.16B -> result A64 int8x8_t vrev32_s8(int8x8_t vec) vec -> Vn.8B REV32 Vd.8B,Vn.8B Vd.8B -> result v7/A32/A64 int8x16_t vrev32q_s8(int8x16_t vec) vec -> Vn.16B REV32 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 int16x4_t vrev32_s16(int16x4_t vec) vec -> Vn.4H REV32 Vd.4H,Vn.4H Vd.4H -> result v7/A32/A64 @@ -3087,16 +3087,16 @@ poly8x8_t vrev32_p8(poly8x8_t vec) vec -> Vn.8B REV32 Vd.8B,Vn.8B Vd.8B -> resul poly8x16_t vrev32q_p8(poly8x16_t vec) vec -> Vn.16B REV32 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 poly16x4_t vrev32_p16(poly16x4_t vec) vec -> Vn.4H REV32 Vd.4H,Vn.4H Vd.4H -> result v7/A32/A64 poly16x8_t vrev32q_p16(poly16x8_t vec) vec -> Vn.8H REV32 Vd.8H,Vn.8H Vd.8H -> result v7/A32/A64 -floatm8x8_t vrev32_fm8(floatm8x8_t vec) vec -> Vn.8B REV32 Vd.8B,Vn.8B Vd.8B -> result A64 -floatm8x16_t vrev32q_fm8(floatm8x16_t vec) vec -> Vn.16B REV32 Vd.16B,Vn.16B Vd.16B -> result A64 +mfloat8x8_t vrev32_mf8(mfloat8x8_t vec) vec -> Vn.8B REV32 Vd.8B,Vn.8B Vd.8B -> result A64 +mfloat8x16_t vrev32q_mf8(mfloat8x16_t vec) vec -> Vn.16B REV32 Vd.16B,Vn.16B Vd.16B -> result A64 int8x8_t vrev16_s8(int8x8_t vec) vec -> Vn.8B REV16 Vd.8B,Vn.8B Vd.8B -> result v7/A32/A64 int8x16_t vrev16q_s8(int8x16_t vec) vec -> Vn.16B REV16 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 uint8x8_t vrev16_u8(uint8x8_t vec) vec -> Vn.8B REV16 Vd.8B,Vn.8B Vd.8B -> result v7/A32/A64 uint8x16_t vrev16q_u8(uint8x16_t vec) vec -> Vn.16B REV16 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 poly8x8_t vrev16_p8(poly8x8_t vec) vec -> Vn.8B REV16 Vd.8B,Vn.8B Vd.8B -> result v7/A32/A64 poly8x16_t vrev16q_p8(poly8x16_t vec) vec -> Vn.16B REV16 Vd.16B,Vn.16B Vd.16B -> result v7/A32/A64 -floatm8x8_t vrev16_fm8(floatm8x8_t vec) vec -> Vn.8B REV16 Vd.8B,Vn.8B Vd.8B -> result A64 -floatm8x16_t vrev16q_fm8(floatm8x16_t vec) vec -> Vn.16B REV16 Vd.16B,Vn.16B Vd.16B -> result A64 +mfloat8x8_t vrev16_mf8(mfloat8x8_t vec) vec -> Vn.8B REV16 Vd.8B,Vn.8B Vd.8B -> result A64 +mfloat8x16_t vrev16q_mf8(mfloat8x16_t vec) vec -> Vn.16B REV16 Vd.16B,Vn.16B Vd.16B -> result A64 int8x8_t vzip1_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vzip1q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vzip1_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H ZIP1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3119,8 +3119,8 @@ poly8x8_t vzip1_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd.8B,Vn poly8x16_t vzip1q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vzip1_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H ZIP1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vzip1q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H ZIP1 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vzip1_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vzip1q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vzip1_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vzip1q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vzip2_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vzip2q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vzip2_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H ZIP2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3143,8 +3143,8 @@ poly8x8_t vzip2_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP2 Vd.8B,Vn poly8x16_t vzip2q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vzip2_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H ZIP2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vzip2q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H ZIP2 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vzip2_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vzip2q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vzip2_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vzip2q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B ZIP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vuzp1_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B UZP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vuzp1q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B UZP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vuzp1_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H UZP1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3167,8 +3167,8 @@ poly8x8_t vuzp1_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B UZP1 Vd.8B,Vn poly8x16_t vuzp1q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B UZP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vuzp1_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H UZP1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vuzp1q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H UZP1 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vuzp1_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B UZP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vuzp1q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B UZP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vuzp1_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B UZP1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vuzp1q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B UZP1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vuzp2_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B UZP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vuzp2q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B UZP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vuzp2_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H UZP2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3191,8 +3191,8 @@ poly8x8_t vuzp2_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B UZP2 Vd.8B,Vn poly8x16_t vuzp2q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B UZP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vuzp2_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H UZP2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vuzp2q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H UZP2 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vuzp2_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B UZP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vuzp2q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B UZP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vuzp2_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B UZP2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vuzp2q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B UZP2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vtrn1_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vtrn1q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vtrn1_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H TRN1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3215,8 +3215,8 @@ poly8x8_t vtrn1_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd.8B,Vn poly8x16_t vtrn1q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vtrn1_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H TRN1 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vtrn1q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H TRN1 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vtrn1_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vtrn1q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vtrn1_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vtrn1q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vtrn2_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B TRN2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 int8x16_t vtrn2q_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B TRN2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int16x4_t vtrn2_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H TRN2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 @@ -3239,8 +3239,8 @@ poly8x8_t vtrn2_p8(poly8x8_t a, poly8x8_t b) a -> Vn.8B;b -> Vm.8B TRN2 Vd.8B,Vn poly8x16_t vtrn2q_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B TRN2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 poly16x4_t vtrn2_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H TRN2 Vd.4H,Vn.4H,Vm.4H Vd.4H -> result A64 poly16x8_t vtrn2q_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H TRN2 Vd.8H,Vn.8H,Vm.8H Vd.8H -> result A64 -floatm8x8_t vtrn2_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B TRN2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 -floatm8x16_t vtrn2q_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B TRN2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 +mfloat8x8_t vtrn2_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B TRN2 Vd.8B,Vn.8B,Vm.8B Vd.8B -> result A64 +mfloat8x16_t vtrn2q_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B TRN2 Vd.16B,Vn.16B,Vm.16B Vd.16B -> result A64 int8x8_t vtbl1_s8(int8x8_t a, int8x8_t idx) Zeros(64):a -> Vn.16B;idx -> Vm.8B TBL Vd.8B,{Vn.16B},Vm.8B Vd.8B -> result v7/A32/A64 uint8x8_t vtbl1_u8(uint8x8_t a, uint8x8_t idx) Zeros(64):a -> Vn.16B;idx -> Vm.8B TBL Vd.8B,{Vn.16B},Vm.8B Vd.8B -> result v7/A32/A64 poly8x8_t vtbl1_p8(poly8x8_t a, uint8x8_t idx) Zeros(64):a -> Vn.16B;idx -> Vm.8B TBL Vd.8B,{Vn.16B},Vm.8B Vd.8B -> result v7/A32/A64 @@ -3356,7 +3356,7 @@ float16x4_t vset_lane_f16(float16_t a, float16x4_t v, __builtin_constant_p(lane) float16x8_t vsetq_lane_f16(float16_t a, float16x8_t v, __builtin_constant_p(lane)) 0<=lane<=7;a -> VnH;v -> Vd.8H MOV Vd.H[lane],Vn.H[0] Vd.8H -> result v7/A32/A64 float32x2_t vset_lane_f32(float32_t a, float32x2_t v, __builtin_constant_p(lane)) 0<=lane<=1;a -> Rn;v -> Vd.2S MOV Vd.S[lane],Rn Vd.2S -> result v7/A32/A64 float64x1_t vset_lane_f64(float64_t a, float64x1_t v, __builtin_constant_p(lane)) lane==0;a -> Rn;v -> Vd.1D MOV Vd.D[lane],Rn Vd.1D -> result A64 -floatm8x8_t vset_lane_fm8(floatm8_t a, floatm8x8_t v, __builtin_constant_p(lane)) 0<=lane<=7;a -> Rn;v -> Vd.8B MOV Vd.B[lane],Rn Vd.8B -> result A64 +mfloat8x8_t vset_lane_mf8(mfloat8_t a, mfloat8x8_t v, __builtin_constant_p(lane)) 0<=lane<=7;a -> Rn;v -> Vd.8B MOV Vd.B[lane],Rn Vd.8B -> result A64 uint8x16_t vsetq_lane_u8(uint8_t a, uint8x16_t v, __builtin_constant_p(lane)) 0<=lane<=15;a -> Rn;v -> Vd.16B MOV Vd.B[lane],Rn Vd.16B -> result v7/A32/A64 uint16x8_t vsetq_lane_u16(uint16_t a, uint16x8_t v, __builtin_constant_p(lane)) 0<=lane<=7;a -> Rn;v -> Vd.8H MOV Vd.H[lane],Rn Vd.8H -> result v7/A32/A64 uint32x4_t vsetq_lane_u32(uint32_t a, uint32x4_t v, __builtin_constant_p(lane)) 0<=lane<=3;a -> Rn;v -> Vd.4S MOV Vd.S[lane],Rn Vd.4S -> result v7/A32/A64 @@ -3370,7 +3370,7 @@ poly8x16_t vsetq_lane_p8(poly8_t a, poly8x16_t v, __builtin_constant_p(lane)) 0< poly16x8_t vsetq_lane_p16(poly16_t a, poly16x8_t v, __builtin_constant_p(lane)) 0<=lane<=7;a -> Rn;v -> Vd.8H MOV Vd.H[lane],Rn Vd.8H -> result v7/A32/A64 float32x4_t vsetq_lane_f32(float32_t a, float32x4_t v, __builtin_constant_p(lane)) 0<=lane<=3;a -> Rn;v -> Vd.4S MOV Vd.S[lane],Rn Vd.4S -> result v7/A32/A64 float64x2_t vsetq_lane_f64(float64_t a, float64x2_t v, __builtin_constant_p(lane)) 0<=lane<=1;a -> Rn;v -> Vd.2D MOV Vd.D[lane],Rn Vd.2D -> result A64 -floatm8x16_t vsetq_lane_fm8(floatm8_t a, floatm8x16_t v, __builtin_constant_p(lane)) 0<=lane<=15;a -> Rn;v -> Vd.16B MOV Vd.B[lane],Rn Vd.16B -> result A64 +mfloat8x16_t vsetq_lane_mf8(mfloat8_t a, mfloat8x16_t v, __builtin_constant_p(lane)) 0<=lane<=15;a -> Rn;v -> Vd.16B MOV Vd.B[lane],Rn Vd.16B -> result A64 float32_t vrecpxs_f32(float32_t a) a -> Sn FRECPX Sd,Sn Sd -> result A64 float64_t vrecpxd_f64(float64_t a) a -> Dn FRECPX Dd,Dn Dd -> result A64 float32x2_t vfma_n_f32(float32x2_t a, float32x2_t b, float32_t n) n -> Vm.S[0];b -> Vn.2S;a -> Vd.2S FMLA Vd.2S,Vn.2S,Vm.S[0] Vd.2S -> result v7/A32/A64 @@ -3390,7 +3390,7 @@ poly16x4x2_t vtrn_p16(poly16x4_t a, poly16x4_t b) a -> Vn.4H;b -> Vm.4H TRN1 Vd1 int32x2x2_t vtrn_s32(int32x2_t a, int32x2_t b) a -> Vn.2S;b -> Vm.2S TRN1 Vd1.2S,Vn.2S,Vm.2S;TRN2 Vd2.2S,Vn.2S,Vm.2S Vd1.2S -> result.val[0];Vd2.2S -> result.val[1] v7/A32/A64 float32x2x2_t vtrn_f32(float32x2_t a, float32x2_t b) a -> Vn.2S;b -> Vm.2S TRN1 Vd1.2S,Vn.2S,Vm.2S;TRN2 Vd2.2S,Vn.2S,Vm.2S Vd1.2S -> result.val[0];Vd2.2S -> result.val[1] v7/A32/A64 uint32x2x2_t vtrn_u32(uint32x2_t a, uint32x2_t b) a -> Vn.2S;b -> Vm.2S TRN1 Vd1.2S,Vn.2S,Vm.2S;TRN2 Vd2.2S,Vn.2S,Vm.2S Vd1.2S -> result.val[0];Vd2.2S -> result.val[1] v7/A32/A64 -floatm8x8x2_t vtrn_fm8(floatm8x8_t a, floatm8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd1.8B,Vn.8B,Vm.8B;TRN2 Vd2.8B,Vn.8B,Vm.8B Vd1.8B -> result.val[0];Vd2.8B -> result.val[1] A64 +mfloat8x8x2_t vtrn_mf8(mfloat8x8_t a, mfloat8x8_t b) a -> Vn.8B;b -> Vm.8B TRN1 Vd1.8B,Vn.8B,Vm.8B;TRN2 Vd2.8B,Vn.8B,Vm.8B Vd1.8B -> result.val[0];Vd2.8B -> result.val[1] A64 int8x16x2_t vtrnq_s8(int8x16_t a, int8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd1.16B,Vn.16B,Vm.16B;TRN2 Vd2.16B,Vn.16B,Vm.16B Vd1.16B -> result.val[0];Vd2.16B -> result.val[1] v7/A32/A64 int16x8x2_t vtrnq_s16(int16x8_t a, int16x8_t b) a -> Vn.8H;b -> Vm.8H TRN1 Vd1.8H,Vn.8H,Vm.8H;TRN2 Vd2.8H,Vn.8H,Vm.8H Vd1.8H -> result.val[0];Vd2.8H -> result.val[1] v7/A32/A64 int32x4x2_t vtrnq_s32(int32x4_t a, int32x4_t b) a -> Vn.4S;b -> Vm.4S TRN1 Vd1.4S,Vn.4S,Vm.4S;TRN2 Vd2.4S,Vn.4S,Vm.4S Vd1.4S -> result.val[0];Vd2.4S -> result.val[1] v7/A32/A64 @@ -3400,7 +3400,7 @@ uint16x8x2_t vtrnq_u16(uint16x8_t a, uint16x8_t b) a -> Vn.8H;b -> Vm.8H TRN1 Vd uint32x4x2_t vtrnq_u32(uint32x4_t a, uint32x4_t b) a -> Vn.4S;b -> Vm.4S TRN1 Vd1.4S,Vn.4S,Vm.4S;TRN2 Vd2.4S,Vn.4S,Vm.4S Vd1.4S -> result.val[0];Vd2.4S -> result.val[1] v7/A32/A64 poly8x16x2_t vtrnq_p8(poly8x16_t a, poly8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd1.16B,Vn.16B,Vm.16B;TRN2 Vd2.16B,Vn.16B,Vm.16B Vd1.16B -> result.val[0];Vd2.16B -> result.val[1] v7/A32/A64 poly16x8x2_t vtrnq_p16(poly16x8_t a, poly16x8_t b) a -> Vn.8H;b -> Vm.8H TRN1 Vd1.8H,Vn.8H,Vm.8H;TRN2 Vd2.8H,Vn.8H,Vm.8H Vd1.8H -> result.val[0];Vd2.8H -> result.val[1] v7/A32/A64 -floatm8x16x2_t vtrnq_fm8(floatm8x16_t a, floatm8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd1.16B,Vn.16B,Vm.16B;TRN2 Vd2.16B,Vn.16B,Vm.16B Vd1.16B -> result.val[0];Vd2.16B -> result.val[1] A64 +mfloat8x16x2_t vtrnq_mf8(mfloat8x16_t a, mfloat8x16_t b) a -> Vn.16B;b -> Vm.16B TRN1 Vd1.16B,Vn.16B,Vm.16B;TRN2 Vd2.16B,Vn.16B,Vm.16B Vd1.16B -> result.val[0];Vd2.16B -> result.val[1] A64 int8x8x2_t vzip_s8(int8x8_t a, int8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd1.8B,Vn.8B,Vm.8B;ZIP2 Vd2.8B,Vn.8B,Vm.8B Vd1.8B -> result.val[0];Vd2.8B -> result.val[1] v7/A32/A64 int16x4x2_t vzip_s16(int16x4_t a, int16x4_t b) a -> Vn.4H;b -> Vm.4H ZIP1 Vd1.4H,Vn.4H,Vm.4H;ZIP2 Vd2.4H,Vn.4H,Vm.4H Vd1.4H -> result.val[0];Vd2.4H -> result.val[1] v7/A32/A64 uint8x8x2_t vzip_u8(uint8x8_t a, uint8x8_t b) a -> Vn.8B;b -> Vm.8B ZIP1 Vd1.8B,Vn.8B,Vm.8B;ZIP2 Vd2.8B,Vn.8B,Vm.8B Vd1.8B -> result.val[0];Vd2.8B -> result.val[1] v7/A32/A64 @@ -3823,10 +3823,10 @@ uint64x2_t vreinterpretq_u64_p128(poly128_t a) a -> Vd.1Q NOP Vd.2D -> result A3 int64x2_t vreinterpretq_s64_p128(poly128_t a) a -> Vd.1Q NOP Vd.2D -> result A32/A64 float64x2_t vreinterpretq_f64_p128(poly128_t a) a -> Vd.1Q NOP Vd.2D -> result A64 float16x8_t vreinterpretq_f16_p128(poly128_t a) a -> Vd.1Q NOP Vd.8H -> result A32/A64 -floatm8x8_t vreinterpret_fm8_u8(uint8x8_t a) a -> Vd.8B NOP Vd.8B -> result A64 -floatm8x16_t vreinterpretq_fm8_u8(uint8x16_t a) a -> Vd.16B NOP Vd.16B -> result A64 -uint8x8_t vreinterpret_u8_fm8(floatm8x8_t a) a -> Vd.8B NOP Vd.8B -> result A64 -uint8x16_t vreinterpretq_u8_fm8(floatm8x16_t a) a -> Vd.16B NOP Vd.16B -> result A64 +mfloat8x8_t vreinterpret_mf8_u8(uint8x8_t a) a -> Vd.8B NOP Vd.8B -> result A64 +mfloat8x16_t vreinterpretq_mf8_u8(uint8x16_t a) a -> Vd.16B NOP Vd.16B -> result A64 +uint8x8_t vreinterpret_u8_mf8(mfloat8x8_t a) a -> Vd.8B NOP Vd.8B -> result A64 +uint8x16_t vreinterpretq_u8_mf8(mfloat8x16_t a) a -> Vd.16B NOP Vd.16B -> result A64 poly128_t vldrq_p128(poly128_t const *ptr) ptr -> Xn LDR Qd,[Xn] Qd -> result A32/A64 void vstrq_p128(poly128_t *ptr, poly128_t val) val -> Qt;ptr -> Xn STR Qt,[Xn] A32/A64
Crypto @@ -4571,27 +4571,27 @@ float32x4_t vbfmlalbq_laneq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b, _ float32x4_t vbfmlaltq_lane_f32(float32x4_t r, bfloat16x8_t a, bfloat16x4_t b, __builtin_constant_p(lane)) r -> Vd.4S;a -> Vn.8H;b -> Vm.4H;0 <= lane <= 3 BFMLALT Vd.4S,Vn.8H,Vm.H[lane] Vd.4S -> result A32/A64 float32x4_t vbfmlaltq_laneq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b, __builtin_constant_p(lane)) r -> Vd.4S;a -> Vn.8H;b -> Vm.8H;0 <= lane <= 7 BFMLALT Vd.4S,Vn.8H,Vm.H[lane] Vd.4S -> result A32/A64
Modal 8-bit floating-point intrinsics -bfloat16x8_t vcvt1_bf16_fm8_fpm(floatm8x8_t vn, fpm_t fpm) vn -> Vn.8B BF1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -bfloat16x8_t vcvt1_low_bf16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.8B BF1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -bfloat16x8_t vcvt2_bf16_fm8_fpm(floatm8x8_t vn, fpm_t fpm) vn -> Vn.8B BF2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -bfloat16x8_t vcvt2_low_bf16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.8B BF2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +bfloat16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) vn -> Vn.8B BF1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +bfloat16x8_t vcvt1_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.8B BF1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +bfloat16x8_t vcvt2_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) vn -> Vn.8B BF2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +bfloat16x8_t vcvt2_low_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.8B BF2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -bfloat16x8_t vcvt1_high_bf16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.16B BF1CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 -bfloat16x8_t vcvt2_high_bf16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.16B BF2CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 +bfloat16x8_t vcvt1_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.16B BF1CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 +bfloat16x8_t vcvt2_high_bf16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.16B BF2CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 -float16x8_t vcvt1_bf16_fm8_fpm(floatm8x8_t vn, fpm_t fpm) vn -> Vn.8B F1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -float16x8_t vcvt1_low_f16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.8B F1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -float16x8_t vcvt2_f16_fm8_fpm(floatm8x8_t vn, fpm_t fpm) vn -> Vn.8B F2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -float16x8_t vcvt2_low_f16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.8B F2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +float16x8_t vcvt1_bf16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) vn -> Vn.8B F1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +float16x8_t vcvt1_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.8B F1CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +float16x8_t vcvt2_f16_mf8_fpm(mfloat8x8_t vn, fpm_t fpm) vn -> Vn.8B F2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 +float16x8_t vcvt2_low_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.8B F2CVTL Vd.8H,Vn.8B Vd.8H -> result A64 -float16x8_t vcvt1_high_f16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.16B F1CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 -float16x8_t vcvt2_high_f16_fm8_fpm(floatm8x16_t vn, fpm_t fpm) vn -> Vn.16B F2CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 +float16x8_t vcvt1_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.16B F1CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 +float16x8_t vcvt2_high_f16_mf8_fpm(mfloat8x16_t vn, fpm_t fpm) vn -> Vn.16B F2CVTL2 Vd.8H,Vn.16B Vd.8H -> result A64 -floatm8x8_t vcvt_fm8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm) vn -> Vn.4S;vm -> Vm.4S FCVTN Vd.8B, Vn.4S, Vm.4S Vd.8B -> result A64 -floatm8x16_t vcvt_high_f32_fpm(floatm8x8_t vd, float32x4_t vn, float32x4_t vm, fpm_t fpm) vn -> Vn.4S;vm -> Vm.4S FCVTN2 Vd.16B, Vn.4S, Vm.4S Vd.16B -> result A64 +mfloat8x8_t vcvt_mf8_f32_fpm(float32x4_t vn, float32x4_t vm, fpm_t fpm) vn -> Vn.4S;vm -> Vm.4S FCVTN Vd.8B, Vn.4S, Vm.4S Vd.8B -> result A64 +mfloat8x16_t vcvt_high_f32_fpm(mfloat8x8_t vd, float32x4_t vn, float32x4_t vm, fpm_t fpm) vn -> Vn.4S;vm -> Vm.4S FCVTN2 Vd.16B, Vn.4S, Vm.4S Vd.16B -> result A64 -floatm8x8_t vcvt_fm8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm) vn -> Vn.4H;vm -> Vm.4H FCVTN Vd.8B, Vn.4H, Vm.4H Vd.8B -> result A64 -floatm8x16_t vcvtq_fm8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t fpm) vn -> Vn.8H;vm -> Vm.8H FCVTN Vd.16B, Vn.8H, Vm.8H Vd.16B -> result A64 +mfloat8x8_t vcvt_mf8_f16_fpm(float16x4_t vn, float16x4_t vm, fpm_t fpm) vn -> Vn.4H;vm -> Vm.4H FCVTN Vd.8B, Vn.4H, Vm.4H Vd.8B -> result A64 +mfloat8x16_t vcvtq_mf8_f16_fpm(float16x8_t vn, float16x8_t vm, fpm_t fpm) vn -> Vn.8H;vm -> Vm.8H FCVTN Vd.16B, Vn.8H, Vm.8H Vd.16B -> result A64 float16x4_t vscale_f16(float16x4_t vn, int16x4_t vm) vn -> Vn.4H;vm -> Vm.4H FSCALE Vd.4H, Vn.4H, Vm.4H Vd.4H -> result A64 float16x8_t vscaleq_f16(float16x8_t vn, int16x8_t vm) vn -> Vn.8H;vm -> Vm.8H FSCALE Vd.8H, Vn.8H, Vm.8H Vd.8H -> result A64 @@ -4599,40 +4599,40 @@ float32x2_t vscale_f32(float32x2_t vn, int32x2_t vm) vn -> Vn.2S;vm -> Vm.2S FSC float32x4_t vscaleq_f32(float32x4_t vn, int32x4_t vm) vn -> Vn.4S;vm -> Vm.4S FSCALE Vd.4S, Vn.4S, Vm.4S Vd.4S -> result A64 float64x2_t vscaleq_f64(float64x2_t vn, int64x2_t vm) vn -> Vn.2D;vm -> Vm.2D FSCALE Vd.2D, Vn.2D, Vm.2D Vd.2D -> result A64 -float32x2_t vdot_f32_fm8_fpm(float32x2_t vd, floatm8x8_t vn, floatm8x8_t vm, fpm_t fpm) vd -> Vd.2S;vn -> Vn.8B;vm -> Vm.8B FDOT Vd.2S, Vn.8B, Vm.8B Vd.2S -> result A64 -float32x4_t vdotq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FDOT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 +float32x2_t vdot_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) vd -> Vd.2S;vn -> Vn.8B;vm -> Vm.8B FDOT Vd.2S, Vn.8B, Vm.8B Vd.2S -> result A64 +float32x4_t vdotq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FDOT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 -float32x2_t vdot_lane_f32_fm8_fpm(float32x2_t vd, floatm8x8_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.2S;vn -> Vn.8B;vm -> Vm.4B;0 <= lane <= 1 FDOT Vd.2S, Vn.8B, Vm.4B[lane] Vd.2S -> result A64 -float32x2_t vdot_laneq_f32_fm8_fpm(float32x2_t vd, floatm8x8_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.2S;vn -> Vn.16B;vm -> Vm.4B;0 <= lane <= 3 FDOT Vd.2S, Vn.8B, Vm.4B[lane] Vd.2S -> result A64 -float32x4_t vdotq_lane_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vn -> Vn.8B;vm -> Vm.4B;0 <= lane <= 1 FDOT Vd.4S, Vn.8B, Vm.4B[lane] Vd.4S -> result A64 -float32x4_t vdotq_laneq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vn -> Vn.16;vm -> Vm.4B;0 <= lane <= 3 FDOT Vd.4S, Vn.8B, Vm.4B[lane] Vd.4SB -> result A64 +float32x2_t vdot_lane_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.2S;vn -> Vn.8B;vm -> Vm.4B;0 <= lane <= 1 FDOT Vd.2S, Vn.8B, Vm.4B[lane] Vd.2S -> result A64 +float32x2_t vdot_laneq_f32_mf8_fpm(float32x2_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.2S;vn -> Vn.16B;vm -> Vm.4B;0 <= lane <= 3 FDOT Vd.2S, Vn.8B, Vm.4B[lane] Vd.2S -> result A64 +float32x4_t vdotq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vn -> Vn.8B;vm -> Vm.4B;0 <= lane <= 1 FDOT Vd.4S, Vn.8B, Vm.4B[lane] Vd.4S -> result A64 +float32x4_t vdotq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vn -> Vn.16;vm -> Vm.4B;0 <= lane <= 3 FDOT Vd.4S, Vn.8B, Vm.4B[lane] Vd.4SB -> result A64 -float16x4_t vdot_f16_fm8_fpm(float16x4_t vd, floatm8x8_t vn, floatm8x8_t vm, fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.8B FDOT Vd.4H, Vn.8B, Vm.8B Vd.4H -> result A64 -float16x8_t vdotq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FDOT Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 +float16x4_t vdot_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.8B FDOT Vd.4H, Vn.8B, Vm.8B Vd.4H -> result A64 +float16x8_t vdotq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FDOT Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 -float16x4_t vdot_lane_f16_fm8_fpm(float16x4_t vd, floatm8x8_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.2B;0 <= lane <= 3 FDOT Vd.4H, Vn.8B, Vm.2B[lane] Vd.4H -> result A64 -float16x4_t vdot_laneq_f16_fm8_fpm(float16x4_t vd, floatm8x8_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.2B;0 <= lane <= 7 FDOT Vd.4H, Vn.8B, Vm.2B[lane] Vd.4H -> result A64 -float16x8_t vdotq_lane_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.2B;0 <= lane <= 3 FDOT Vd.8H, Vn.16B, Vm.2B[lane] Vd.8H -> result A64 -float16x8_t vdotq_laneq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.2B;0 <= lane <= 7 FDOT Vd.8H, Vn.16B, Vm.2B[lane] Vd.8H -> result A64 +float16x4_t vdot_lane_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.2B;0 <= lane <= 3 FDOT Vd.4H, Vn.8B, Vm.2B[lane] Vd.4H -> result A64 +float16x4_t vdot_laneq_f16_mf8_fpm(float16x4_t vd, mfloat8x8_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4H;vn -> Vn.8B;vm -> Vm.2B;0 <= lane <= 7 FDOT Vd.4H, Vn.8B, Vm.2B[lane] Vd.4H -> result A64 +float16x8_t vdotq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.2B;0 <= lane <= 3 FDOT Vd.8H, Vn.16B, Vm.2B[lane] Vd.8H -> result A64 +float16x8_t vdotq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.2B;0 <= lane <= 7 FDOT Vd.8H, Vn.16B, Vm.2B[lane] Vd.8H -> result A64 -float16x8_t vmlalbq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FMLALB Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 -float16x8_t vmlaltq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FMLALT Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 +float16x8_t vmlalbq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FMLALB Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 +float16x8_t vmlaltq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.16B FMLALT Vd.8H, Vn.16B, Vm.16B Vd.8H -> result A64 -float16x8_t vmlalbq_lane_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 7 FMLALB Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 -float16x8_t vmlalbq_laneq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 15 FMLALB Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 -float16x8_t vmlaltq_lane_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 7 FMLALT Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 -float16x8_t vmlaltq_laneq_f16_fm8_fpm(float16x8_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 15 FMLALT Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 +float16x8_t vmlalbq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 7 FMLALB Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 +float16x8_t vmlalbq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 15 FMLALB Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 +float16x8_t vmlaltq_lane_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 7 FMLALT Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 +float16x8_t vmlaltq_laneq_f16_mf8_fpm(float16x8_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.8H;vn -> Vn.16B;vm -> Vm.B;0 <= lane <= 15 FMLALT Vd.8H, Vn.16B, Vm.B[lane] Vd.8H -> result A64 -float32x4_t vmlallbbq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLBB Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 -float32x4_t vmlallbtq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLBT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 -float32x4_t vmlalltbq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLTB Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 -float32x4_t vmlallttq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLTT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 +float32x4_t vmlallbbq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLBB Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 +float32x4_t vmlallbtq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLBT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 +float32x4_t vmlalltbq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLTB Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 +float32x4_t vmlallttq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, fpm_t fpm) vd -> Vd.4S;vn -> Vn.16B;vm -> Vm.16B FMLALLTT Vd.4S, Vn.16B, Vm.16B Vd.4S -> result A64 -float32x4_t vmlallbbq_lane_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlallbbq_laneq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlallbtq_lane_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlallbtq_laneq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlalltbq_lane_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlalltbq_laneq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlallttq_lane_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 -float32x4_t vmlallttq_laneq_f32_fm8_fpm(float32x4_t vd, floatm8x16_t vn, floatm8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallbbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallbbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallbtq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallbtq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlalltbq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlalltbq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallttq_lane_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x8_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 7 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 +float32x4_t vmlallttq_laneq_f32_mf8_fpm(float32x4_t vd, mfloat8x16_t vn, mfloat8x16_t vm, __builtin_constant_p(lane), fpm_t fpm) vd -> Vd.4S;vm -> Vn.16B; vm -> Vm.B; 0 <= lane <= 15 FMLALLBB Vd.4S, Vn.16B, Vm.B[lane] Vd.4S -> result A64 diff --git a/tools/intrinsic_db/advsimd_classification.csv b/tools/intrinsic_db/advsimd_classification.csv index 739323cd..055bd092 100644 --- a/tools/intrinsic_db/advsimd_classification.csv +++ b/tools/intrinsic_db/advsimd_classification.csv @@ -1843,8 +1843,8 @@ vcopy_lane_p8 Vector manipulation|Copy vector lane vcopyq_lane_p8 Vector manipulation|Copy vector lane vcopy_lane_p16 Vector manipulation|Copy vector lane vcopyq_lane_p16 Vector manipulation|Copy vector lane -vcopy_lane_fm8 Vector manipulation|Copy vector lane -vcopyq_lane_fm8 Vector manipulation|Copy vector lane +vcopy_lane_mf8 Vector manipulation|Copy vector lane +vcopyq_lane_mf8 Vector manipulation|Copy vector lane vcopy_laneq_s8 Vector manipulation|Copy vector lane vcopyq_laneq_s8 Vector manipulation|Copy vector lane vcopy_laneq_s16 Vector manipulation|Copy vector lane @@ -1871,8 +1871,8 @@ vcopy_laneq_p8 Vector manipulation|Copy vector lane vcopyq_laneq_p8 Vector manipulation|Copy vector lane vcopy_laneq_p16 Vector manipulation|Copy vector lane vcopyq_laneq_p16 Vector manipulation|Copy vector lane -vcopy_laneq_fm8 Vector manipulation|Copy vector lane -vcopyq_laneq_fm8 Vector manipulation|Copy vector lane +vcopy_laneq_mf8 Vector manipulation|Copy vector lane +vcopyq_laneq_mf8 Vector manipulation|Copy vector lane vrbit_s8 Vector manipulation|Reverse bits within elements vrbitq_s8 Vector manipulation|Reverse bits within elements vrbit_u8 Vector manipulation|Reverse bits within elements @@ -1893,7 +1893,7 @@ vcreate_f32 Vector manipulation|Create vector vcreate_p8 Vector manipulation|Create vector vcreate_p16 Vector manipulation|Create vector vcreate_f64 Vector manipulation|Create vector -vcreate_fm8 Vector manipulation|Create vector +vcreate_mf8 Vector manipulation|Create vector vdup_n_s8 Vector manipulation|Set all lanes to the same value vdupq_n_s8 Vector manipulation|Set all lanes to the same value vdup_n_s16 Vector manipulation|Set all lanes to the same value @@ -1920,8 +1920,8 @@ vdup_n_p16 Vector manipulation|Set all lanes to the same value vdupq_n_p16 Vector manipulation|Set all lanes to the same value vdup_n_f64 Vector manipulation|Set all lanes to the same value vdupq_n_f64 Vector manipulation|Set all lanes to the same value -vdup_n_fm8 Vector manipulation|Set all lanes to the same value -vdupq_n_fm8 Vector manipulation|Set all lanes to the same value +vdup_n_mf8 Vector manipulation|Set all lanes to the same value +vdupq_n_mf8 Vector manipulation|Set all lanes to the same value vmov_n_s8 Vector manipulation|Set all lanes to the same value vmovq_n_s8 Vector manipulation|Set all lanes to the same value vmov_n_s16 Vector manipulation|Set all lanes to the same value @@ -1946,8 +1946,8 @@ vmov_n_p16 Vector manipulation|Set all lanes to the same value vmovq_n_p16 Vector manipulation|Set all lanes to the same value vmov_n_f64 Vector manipulation|Set all lanes to the same value vmovq_n_f64 Vector manipulation|Set all lanes to the same value -vmov_n_fm8 Vector manipulation|Set all lanes to the same value -vmovq_n_fm8 Vector manipulation|Set all lanes to the same value +vmov_n_mf8 Vector manipulation|Set all lanes to the same value +vmovq_n_mf8 Vector manipulation|Set all lanes to the same value vdup_lane_s8 Vector manipulation|Set all lanes to the same value vdupq_lane_s8 Vector manipulation|Set all lanes to the same value vdup_lane_s16 Vector manipulation|Set all lanes to the same value @@ -1974,8 +1974,8 @@ vdup_lane_p16 Vector manipulation|Set all lanes to the same value vdupq_lane_p16 Vector manipulation|Set all lanes to the same value vdup_lane_f64 Vector manipulation|Set all lanes to the same value vdupq_lane_f64 Vector manipulation|Set all lanes to the same value -vdup_lane_fm8 Vector manipulation|Set all lanes to the same value -vdupq_lane_fm8 Vector manipulation|Set all lanes to the same value +vdup_lane_mf8 Vector manipulation|Set all lanes to the same value +vdupq_lane_mf8 Vector manipulation|Set all lanes to the same value vdup_laneq_s8 Vector manipulation|Set all lanes to the same value vdupq_laneq_s8 Vector manipulation|Set all lanes to the same value vdup_laneq_s16 Vector manipulation|Set all lanes to the same value @@ -2002,8 +2002,8 @@ vdup_laneq_p16 Vector manipulation|Set all lanes to the same value vdupq_laneq_p16 Vector manipulation|Set all lanes to the same value vdup_laneq_f64 Vector manipulation|Set all lanes to the same value vdupq_laneq_f64 Vector manipulation|Set all lanes to the same value -vdup_laneq_fm8 Vector manipulation|Set all lanes to the same value -vdupq_laneq_fm8 Vector manipulation|Set all lanes to the same value +vdup_laneq_mf8 Vector manipulation|Set all lanes to the same value +vdupq_laneq_mf8 Vector manipulation|Set all lanes to the same value vcombine_s8 Vector manipulation|Combine vectors vcombine_s16 Vector manipulation|Combine vectors vcombine_s32 Vector manipulation|Combine vectors @@ -2018,7 +2018,7 @@ vcombine_f32 Vector manipulation|Combine vectors vcombine_p8 Vector manipulation|Combine vectors vcombine_p16 Vector manipulation|Combine vectors vcombine_f64 Vector manipulation|Combine vectors -vcombine_fm8 Vector manipulation|Combine vectors +vcombine_mf8 Vector manipulation|Combine vectors vget_high_s8 Vector manipulation|Split vectors vget_high_s16 Vector manipulation|Split vectors vget_high_s32 Vector manipulation|Split vectors @@ -2033,7 +2033,7 @@ vget_high_f32 Vector manipulation|Split vectors vget_high_p8 Vector manipulation|Split vectors vget_high_p16 Vector manipulation|Split vectors vget_high_f64 Vector manipulation|Split vectors -vget_high_fm8 Vector manipulation|Split vectors +vget_high_mf8 Vector manipulation|Split vectors vget_low_s8 Vector manipulation|Split vectors vget_low_s16 Vector manipulation|Split vectors vget_low_s32 Vector manipulation|Split vectors @@ -2048,7 +2048,7 @@ vget_low_f32 Vector manipulation|Split vectors vget_low_p8 Vector manipulation|Split vectors vget_low_p16 Vector manipulation|Split vectors vget_low_f64 Vector manipulation|Split vectors -vget_low_fm8 Vector manipulation|Split vectors +vget_low_mf8 Vector manipulation|Split vectors vdupb_lane_s8 Vector manipulation|Extract one element from vector vduph_lane_s16 Vector manipulation|Extract one element from vector vdups_lane_s32 Vector manipulation|Extract one element from vector @@ -2061,7 +2061,7 @@ vdups_lane_f32 Vector manipulation|Extract one element from vector vdupd_lane_f64 Vector manipulation|Extract one element from vector vdupb_lane_p8 Vector manipulation|Extract one element from vector vduph_lane_p16 Vector manipulation|Extract one element from vector -vdupb_lane_fm8 Vector manipulation|Extract one element from vector +vdupb_lane_mf8 Vector manipulation|Extract one element from vector vdupb_laneq_s8 Vector manipulation|Extract one element from vector vduph_laneq_s16 Vector manipulation|Extract one element from vector vdups_laneq_s32 Vector manipulation|Extract one element from vector @@ -2074,7 +2074,7 @@ vdups_laneq_f32 Vector manipulation|Extract one element from vector vdupd_laneq_f64 Vector manipulation|Extract one element from vector vdupb_laneq_p8 Vector manipulation|Extract one element from vector vduph_laneq_p16 Vector manipulation|Extract one element from vector -vdupb_laneq_fm8 Vector manipulation|Extract one element from vector +vdupb_laneq_mf8 Vector manipulation|Extract one element from vector vld1_s8 Load|Stride vld1q_s8 Load|Stride vld1_s16 Load|Stride @@ -2103,8 +2103,8 @@ vld1_p16 Load|Stride vld1q_p16 Load|Stride vld1_f64 Load|Stride vld1q_f64 Load|Stride -vld1_fm8 Load|Stride -vld1q_fm8 Load|Stride +vld1_mf8 Load|Stride +vld1q_mf8 Load|Stride vld1_lane_s8 Load|Stride vld1q_lane_s8 Load|Stride vld1_lane_s16 Load|Stride @@ -2133,8 +2133,8 @@ vld1_lane_p16 Load|Stride vld1q_lane_p16 Load|Stride vld1_lane_f64 Load|Stride vld1q_lane_f64 Load|Stride -vld1_lane_fm8 Load|Stride -vld1q_lane_fm8 Load|Stride +vld1_lane_mf8 Load|Stride +vld1q_lane_mf8 Load|Stride vldap1q_lane_u64 Load|Stride vldap1q_lane_s64 Load|Stride vldap1q_lane_f64 Load|Stride @@ -2179,8 +2179,8 @@ vld1_dup_p16 Load|Stride vld1q_dup_p16 Load|Stride vld1_dup_f64 Load|Stride vld1q_dup_f64 Load|Stride -vld1_dup_fm8 Load|Stride -vld1q_dup_fm8 Load|Stride +vld1_dup_mf8 Load|Stride +vld1q_dup_mf8 Load|Stride vst1_s8 Store|Stride vst1q_s8 Store|Stride vst1_s16 Store|Stride @@ -2209,8 +2209,8 @@ vst1_p16 Store|Stride vst1q_p16 Store|Stride vst1_f64 Store|Stride vst1q_f64 Store|Stride -vst1_fm8 Store|Stride -vst1q_fm8 Store|Stride +vst1_mf8 Store|Stride +vst1q_mf8 Store|Stride vst1_lane_s8 Store|Stride vst1q_lane_s8 Store|Stride vst1_lane_s16 Store|Stride @@ -2239,8 +2239,8 @@ vst1_lane_p16 Store|Stride vst1q_lane_p16 Store|Stride vst1_lane_f64 Store|Stride vst1q_lane_f64 Store|Stride -vst1_lane_fm8 Store|Stride -vst1q_lane_fm8 Store|Stride +vst1_lane_mf8 Store|Stride +vst1q_lane_mf8 Store|Stride vld2_s8 Load|Stride vld2q_s8 Load|Stride vld2_s16 Load|Stride @@ -2269,8 +2269,8 @@ vld2q_u64 Load|Stride vld2q_p64 Load|Stride vld2_f64 Load|Stride vld2q_f64 Load|Stride -vld2_fm8 Load|Stride -vld2q_fm8 Load|Stride +vld2_mf8 Load|Stride +vld2q_mf8 Load|Stride vld3_s8 Load|Stride vld3q_s8 Load|Stride vld3_s16 Load|Stride @@ -2299,8 +2299,8 @@ vld3q_u64 Load|Stride vld3q_p64 Load|Stride vld3_f64 Load|Stride vld3q_f64 Load|Stride -vld3_fm8 Load|Stride -vld3q_fm8 Load|Stride +vld3_mf8 Load|Stride +vld3q_mf8 Load|Stride vld4_s8 Load|Stride vld4q_s8 Load|Stride vld4_s16 Load|Stride @@ -2329,8 +2329,8 @@ vld4q_u64 Load|Stride vld4q_p64 Load|Stride vld4_f64 Load|Stride vld4q_f64 Load|Stride -vld4_fm8 Load|Stride -vld4q_fm8 Load|Stride +vld4_mf8 Load|Stride +vld4q_mf8 Load|Stride vld2_dup_s8 Load|Stride vld2q_dup_s8 Load|Stride vld2_dup_s16 Load|Stride @@ -2359,8 +2359,8 @@ vld2q_dup_u64 Load|Stride vld2q_dup_p64 Load|Stride vld2_dup_f64 Load|Stride vld2q_dup_f64 Load|Stride -vld2_dup_fm8 Load|Stride -vld2q_dup_fm8 Load|Stride +vld2_dup_mf8 Load|Stride +vld2q_dup_mf8 Load|Stride vld3_dup_s8 Load|Stride vld3q_dup_s8 Load|Stride vld3_dup_s16 Load|Stride @@ -2389,8 +2389,8 @@ vld3q_dup_u64 Load|Stride vld3q_dup_p64 Load|Stride vld3_dup_f64 Load|Stride vld3q_dup_f64 Load|Stride -vld3_dup_fm8 Load|Stride -vld3q_dup_fm8 Load|Stride +vld3_dup_mf8 Load|Stride +vld3q_dup_mf8 Load|Stride vld4_dup_s8 Load|Stride vld4q_dup_s8 Load|Stride vld4_dup_s16 Load|Stride @@ -2419,8 +2419,8 @@ vld4q_dup_u64 Load|Stride vld4q_dup_p64 Load|Stride vld4_dup_f64 Load|Stride vld4q_dup_f64 Load|Stride -vld4_dup_fm8 Load|Stride -vld4q_dup_fm8 Load|Stride +vld4_dup_mf8 Load|Stride +vld4q_dup_mf8 Load|Stride vst2_s8 Store|Stride vst2q_s8 Store|Stride vst2_s16 Store|Stride @@ -2449,8 +2449,8 @@ vst2q_u64 Store|Stride vst2q_p64 Store|Stride vst2_f64 Store|Stride vst2q_f64 Store|Stride -vst2_fm8 Store|Stride -vst2q_fm8 Store|Stride +vst2_mf8 Store|Stride +vst2q_mf8 Store|Stride vst3_s8 Store|Stride vst3q_s8 Store|Stride vst3_s16 Store|Stride @@ -2479,8 +2479,8 @@ vst3q_u64 Store|Stride vst3q_p64 Store|Stride vst3_f64 Store|Stride vst3q_f64 Store|Stride -vst3_fm8 Store|Stride -vst3q_fm8 Store|Stride +vst3_mf8 Store|Stride +vst3q_mf8 Store|Stride vst4_s8 Store|Stride vst4q_s8 Store|Stride vst4_s16 Store|Stride @@ -2509,8 +2509,8 @@ vst4q_u64 Store|Stride vst4q_p64 Store|Stride vst4_f64 Store|Stride vst4q_f64 Store|Stride -vst4_fm8 Store|Stride -vst4q_fm8 Store|Stride +vst4_mf8 Store|Stride +vst4q_mf8 Store|Stride vld2_lane_s16 Load|Stride vld2q_lane_s16 Load|Stride vld2_lane_s32 Load|Stride @@ -2539,8 +2539,8 @@ vld2_lane_p64 Load|Stride vld2q_lane_p64 Load|Stride vld2_lane_f64 Load|Stride vld2q_lane_f64 Load|Stride -vld2_lane_fm8 Load|Stride -vld2q_lane_fm8 Load|Stride +vld2_lane_mf8 Load|Stride +vld2q_lane_mf8 Load|Stride vld3_lane_s16 Load|Stride vld3q_lane_s16 Load|Stride vld3_lane_s32 Load|Stride @@ -2569,8 +2569,8 @@ vld3_lane_p64 Load|Stride vld3q_lane_p64 Load|Stride vld3_lane_f64 Load|Stride vld3q_lane_f64 Load|Stride -vld3_lane_fm8 Load|Stride -vld3q_lane_fm8 Load|Stride +vld3_lane_mf8 Load|Stride +vld3q_lane_mf8 Load|Stride vld4_lane_s16 Load|Stride vld4q_lane_s16 Load|Stride vld4_lane_s32 Load|Stride @@ -2599,20 +2599,20 @@ vld4_lane_p64 Load|Stride vld4q_lane_p64 Load|Stride vld4_lane_f64 Load|Stride vld4q_lane_f64 Load|Stride -vld4_lane_fm8 Load|Stride -vld4q_lane_fm8 Load|Stride +vld4_lane_mf8 Load|Stride +vld4q_lane_mf8 Load|Stride vst2_lane_s8 Store|Stride vst2_lane_u8 Store|Stride vst2_lane_p8 Store|Stride -vst2_lane_fm8 Store|Stride +vst2_lane_mf8 Store|Stride vst3_lane_s8 Store|Stride vst3_lane_u8 Store|Stride vst3_lane_p8 Store|Stride -vst3_lane_fm8 Store|Stride +vst3_lane_mf8 Store|Stride vst4_lane_s8 Store|Stride vst4_lane_u8 Store|Stride vst4_lane_p8 Store|Stride -vst4_lane_fm8 Store|Stride +vst4_lane_mf8 Store|Stride vst2_lane_s16 Store|Stride vst2q_lane_s16 Store|Stride vst2_lane_s32 Store|Stride @@ -2638,7 +2638,7 @@ vst2_lane_p64 Store|Stride vst2q_lane_p64 Store|Stride vst2_lane_f64 Store|Stride vst2q_lane_f64 Store|Stride -vst2q_lane_fm8 Store|Stride +vst2q_lane_mf8 Store|Stride vst3_lane_s16 Store|Stride vst3q_lane_s16 Store|Stride vst3_lane_s32 Store|Stride @@ -2664,7 +2664,7 @@ vst3_lane_p64 Store|Stride vst3q_lane_p64 Store|Stride vst3_lane_f64 Store|Stride vst3q_lane_f64 Store|Stride -vst3q_lane_fm8 Store|Stride +vst3q_lane_mf8 Store|Stride vst4_lane_s16 Store|Stride vst4q_lane_s16 Store|Stride vst4_lane_s32 Store|Stride @@ -2690,7 +2690,7 @@ vst4_lane_p64 Store|Stride vst4q_lane_p64 Store|Stride vst4_lane_f64 Store|Stride vst4q_lane_f64 Store|Stride -vst4q_lane_fm8 Store|Stride +vst4q_lane_mf8 Store|Stride vst1_s8_x2 Store|Stride vst1q_s8_x2 Store|Stride vst1_s16_x2 Store|Stride @@ -2719,8 +2719,8 @@ vst1q_u64_x2 Store|Stride vst1q_p64_x2 Store|Stride vst1_f64_x2 Store|Stride vst1q_f64_x2 Store|Stride -vst1_fm8_x2 Store|Stride -vst1q_fm8_x2 Store|Stride +vst1_mf8_x2 Store|Stride +vst1q_mf8_x2 Store|Stride vst1_s8_x3 Store|Stride vst1q_s8_x3 Store|Stride vst1_s16_x3 Store|Stride @@ -2749,8 +2749,8 @@ vst1q_u64_x3 Store|Stride vst1q_p64_x3 Store|Stride vst1_f64_x3 Store|Stride vst1q_f64_x3 Store|Stride -vst1_fm8_x3 Store|Stride -vst1q_fm8_x3 Store|Stride +vst1_mf8_x3 Store|Stride +vst1q_mf8_x3 Store|Stride vst1_s8_x4 Store|Stride vst1q_s8_x4 Store|Stride vst1_s16_x4 Store|Stride @@ -2779,8 +2779,8 @@ vst1q_u64_x4 Store|Stride vst1q_p64_x4 Store|Stride vst1_f64_x4 Store|Stride vst1q_f64_x4 Store|Stride -vst1_fm8_x4 Store|Stride -vst1q_fm8_x4 Store|Stride +vst1_mf8_x4 Store|Stride +vst1q_mf8_x4 Store|Stride vld1_s8_x2 Load|Stride vld1q_s8_x2 Load|Stride vld1_s16_x2 Load|Stride @@ -2809,8 +2809,8 @@ vld1q_u64_x2 Load|Stride vld1q_p64_x2 Load|Stride vld1_f64_x2 Load|Stride vld1q_f64_x2 Load|Stride -vld1_fm8_x2 Load|Stride -vld1q_fm8_x2 Load|Stride +vld1_mf8_x2 Load|Stride +vld1q_mf8_x2 Load|Stride vld1_s8_x3 Load|Stride vld1q_s8_x3 Load|Stride vld1_s16_x3 Load|Stride @@ -2839,8 +2839,8 @@ vld1q_u64_x3 Load|Stride vld1q_p64_x3 Load|Stride vld1_f64_x3 Load|Stride vld1q_f64_x3 Load|Stride -vld1_fm8_x3 Load|Stride -vld1q_fm8_x3 Load|Stride +vld1_mf8_x3 Load|Stride +vld1q_mf8_x3 Load|Stride vld1_s8_x4 Load|Stride vld1q_s8_x4 Load|Stride vld1_s16_x4 Load|Stride @@ -2869,8 +2869,8 @@ vld1q_u64_x4 Load|Stride vld1q_p64_x4 Load|Stride vld1_f64_x4 Load|Stride vld1q_f64_x4 Load|Stride -vld1_fm8_x4 Load|Stride -vld1q_fm8_x4 Load|Stride +vld1_mf8_x4 Load|Stride +vld1q_mf8_x4 Load|Stride vpadd_s8 Vector arithmetic|Pairwise arithmetic|Pairwise addition vpadd_s16 Vector arithmetic|Pairwise arithmetic|Pairwise addition vpadd_s32 Vector arithmetic|Pairwise arithmetic|Pairwise addition @@ -3051,8 +3051,8 @@ vext_p8 Vector manipulation|Extract vector from a pair of vectors vextq_p8 Vector manipulation|Extract vector from a pair of vectors vext_p16 Vector manipulation|Extract vector from a pair of vectors vextq_p16 Vector manipulation|Extract vector from a pair of vectors -vext_fm8 Vector manipulation|Extract vector from a pair of vectors -vextq_fm8 Vector manipulation|Extract vector from a pair of vectors +vext_mf8 Vector manipulation|Extract vector from a pair of vectors +vextq_mf8 Vector manipulation|Extract vector from a pair of vectors vrev64_s8 Vector manipulation|Reverse elements vrev64q_s8 Vector manipulation|Reverse elements vrev64_s16 Vector manipulation|Reverse elements @@ -3071,8 +3071,8 @@ vrev64_p8 Vector manipulation|Reverse elements vrev64q_p8 Vector manipulation|Reverse elements vrev64_p16 Vector manipulation|Reverse elements vrev64q_p16 Vector manipulation|Reverse elements -vrev64_fm8 Vector manipulation|Reverse elements -vrev64q_fm8 Vector manipulation|Reverse elements +vrev64_mf8 Vector manipulation|Reverse elements +vrev64q_mf8 Vector manipulation|Reverse elements vrev32_s8 Vector manipulation|Reverse elements vrev32q_s8 Vector manipulation|Reverse elements vrev32_s16 Vector manipulation|Reverse elements @@ -3085,16 +3085,16 @@ vrev32_p8 Vector manipulation|Reverse elements vrev32q_p8 Vector manipulation|Reverse elements vrev32_p16 Vector manipulation|Reverse elements vrev32q_p16 Vector manipulation|Reverse elements -vrev32_fm8 Vector manipulation|Reverse elements -vrev32q_fm8 Vector manipulation|Reverse elements +vrev32_mf8 Vector manipulation|Reverse elements +vrev32q_mf8 Vector manipulation|Reverse elements vrev16_s8 Vector manipulation|Reverse elements vrev16q_s8 Vector manipulation|Reverse elements vrev16_u8 Vector manipulation|Reverse elements vrev16q_u8 Vector manipulation|Reverse elements vrev16_p8 Vector manipulation|Reverse elements vrev16q_p8 Vector manipulation|Reverse elements -vrev16_fm8 Vector manipulation|Reverse elements -vrev16q_fm8 Vector manipulation|Reverse elements +vrev16_mf8 Vector manipulation|Reverse elements +vrev16q_mf8 Vector manipulation|Reverse elements vzip1_s8 Vector manipulation|Zip elements vzip1q_s8 Vector manipulation|Zip elements vzip1_s16 Vector manipulation|Zip elements @@ -3117,8 +3117,8 @@ vzip1_p8 Vector manipulation|Zip elements vzip1q_p8 Vector manipulation|Zip elements vzip1_p16 Vector manipulation|Zip elements vzip1q_p16 Vector manipulation|Zip elements -vzip1_fm8 Vector manipulation|Zip elements -vzip1q_fm8 Vector manipulation|Zip elements +vzip1_mf8 Vector manipulation|Zip elements +vzip1q_mf8 Vector manipulation|Zip elements vzip2_s8 Vector manipulation|Zip elements vzip2q_s8 Vector manipulation|Zip elements vzip2_s16 Vector manipulation|Zip elements @@ -3141,8 +3141,8 @@ vzip2_p8 Vector manipulation|Zip elements vzip2q_p8 Vector manipulation|Zip elements vzip2_p16 Vector manipulation|Zip elements vzip2q_p16 Vector manipulation|Zip elements -vzip2_fm8 Vector manipulation|Zip elements -vzip2q_fm8 Vector manipulation|Zip elements +vzip2_mf8 Vector manipulation|Zip elements +vzip2q_mf8 Vector manipulation|Zip elements vuzp1_s8 Vector manipulation|Unzip elements vuzp1q_s8 Vector manipulation|Unzip elements vuzp1_s16 Vector manipulation|Unzip elements @@ -3165,8 +3165,8 @@ vuzp1_p8 Vector manipulation|Unzip elements vuzp1q_p8 Vector manipulation|Unzip elements vuzp1_p16 Vector manipulation|Unzip elements vuzp1q_p16 Vector manipulation|Unzip elements -vuzp1_fm8 Vector manipulation|Unzip elements -vuzp1q_fm8 Vector manipulation|Unzip elements +vuzp1_mf8 Vector manipulation|Unzip elements +vuzp1q_mf8 Vector manipulation|Unzip elements vuzp2_s8 Vector manipulation|Unzip elements vuzp2q_s8 Vector manipulation|Unzip elements vuzp2_s16 Vector manipulation|Unzip elements @@ -3189,8 +3189,8 @@ vuzp2_p8 Vector manipulation|Unzip elements vuzp2q_p8 Vector manipulation|Unzip elements vuzp2_p16 Vector manipulation|Unzip elements vuzp2q_p16 Vector manipulation|Unzip elements -vuzp2_fm8 Vector manipulation|Unzip elements -vuzp2q_fm8 Vector manipulation|Unzip elements +vuzp2_mf8 Vector manipulation|Unzip elements +vuzp2q_mf8 Vector manipulation|Unzip elements vtrn1_s8 Vector manipulation|Transpose elements vtrn1q_s8 Vector manipulation|Transpose elements vtrn1_s16 Vector manipulation|Transpose elements @@ -3213,8 +3213,8 @@ vtrn1_p8 Vector manipulation|Transpose elements vtrn1q_p8 Vector manipulation|Transpose elements vtrn1_p16 Vector manipulation|Transpose elements vtrn1q_p16 Vector manipulation|Transpose elements -vtrn1_fm8 Vector manipulation|Transpose elements -vtrn1q_fm8 Vector manipulation|Transpose elements +vtrn1_mf8 Vector manipulation|Transpose elements +vtrn1q_mf8 Vector manipulation|Transpose elements vtrn2_s8 Vector manipulation|Transpose elements vtrn2q_s8 Vector manipulation|Transpose elements vtrn2_s16 Vector manipulation|Transpose elements @@ -3237,8 +3237,8 @@ vtrn2_p8 Vector manipulation|Transpose elements vtrn2q_p8 Vector manipulation|Transpose elements vtrn2_p16 Vector manipulation|Transpose elements vtrn2q_p16 Vector manipulation|Transpose elements -vtrn2_fm8 Vector manipulation|Transpose elements -vtrn2q_fm8 Vector manipulation|Transpose elements +vtrn2_mf8 Vector manipulation|Transpose elements +vtrn2q_mf8 Vector manipulation|Transpose elements vtbl1_s8 Table lookup|Table lookup vtbl1_u8 Table lookup|Table lookup vtbl1_p8 Table lookup|Table lookup @@ -3367,8 +3367,8 @@ vsetq_lane_p8 Vector manipulation|Set vector lane vsetq_lane_p16 Vector manipulation|Set vector lane vsetq_lane_f32 Vector manipulation|Set vector lane vsetq_lane_f64 Vector manipulation|Set vector lane -vset_lane_fm8 Vector manipulation|Set vector lane -vsetq_lane_fm8 Vector manipulation|Set vector lane +vset_lane_mf8 Vector manipulation|Set vector lane +vsetq_lane_mf8 Vector manipulation|Set vector lane vrecpxs_f32 Vector arithmetic|Reciprocal|Reciprocal exponent vrecpxd_f64 Vector arithmetic|Reciprocal|Reciprocal exponent vfma_n_f32 Scalar arithmetic|Fused multiply-accumulate by scalar @@ -3388,7 +3388,7 @@ vtrn_p16 Vector manipulation|Transpose elements vtrn_s32 Vector manipulation|Transpose elements vtrn_f32 Vector manipulation|Transpose elements vtrn_u32 Vector manipulation|Transpose elements -vtrn_fm8 Vector manipulation|Transpose elements +vtrn_mf8 Vector manipulation|Transpose elements vtrnq_s8 Vector manipulation|Transpose elements vtrnq_s16 Vector manipulation|Transpose elements vtrnq_s32 Vector manipulation|Transpose elements @@ -3398,7 +3398,7 @@ vtrnq_u16 Vector manipulation|Transpose elements vtrnq_u32 Vector manipulation|Transpose elements vtrnq_p8 Vector manipulation|Transpose elements vtrnq_p16 Vector manipulation|Transpose elements -vtrnq_fm8 Vector manipulation|Transpose elements +vtrnq_mf8 Vector manipulation|Transpose elements vzip_s8 Vector manipulation|Zip elements vzip_s16 Vector manipulation|Zip elements vzip_u8 Vector manipulation|Zip elements @@ -3821,10 +3821,10 @@ vreinterpretq_u64_p128 Data type conversion|Reinterpret casts vreinterpretq_s64_p128 Data type conversion|Reinterpret casts vreinterpretq_f64_p128 Data type conversion|Reinterpret casts vreinterpretq_f16_p128 Data type conversion|Reinterpret casts -vreinterpret_fm8_u8 Data type conversion|Reinterpret casts -vreinterpretq_fm8_u8 Data type conversion|Reinterpret casts -vreinterpret_u8_fm8 Data type conversion|Reinterpret casts -vreinterpretq_u8_fm8 Data type conversion|Reinterpret casts +vreinterpret_mf8_u8 Data type conversion|Reinterpret casts +vreinterpretq_mf8_u8 Data type conversion|Reinterpret casts +vreinterpret_u8_mf8 Data type conversion|Reinterpret casts +vreinterpretq_u8_mf8 Data type conversion|Reinterpret casts vldrq_p128 Load|Load vstrq_p128 Store|Store vaeseq_u8 Cryptography|AES @@ -4488,54 +4488,54 @@ vbfmlalbq_lane_f32 Scalar arithmetic|Vector multiply-accumulate by scalar vbfmlalbq_laneq_f32 Scalar arithmetic|Vector multiply-accumulate by scalar vbfmlaltq_lane_f32 Scalar arithmetic|Vector multiply-accumulate by scalar vbfmlaltq_laneq_f32 Scalar arithmetic|Vector multiply-accumulate by scalar -vcvt1_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, lower) -vcvt1_low_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, lower) -vcvt2_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, lower) -vcvt2_low_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, lower) -vcvt1_high_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, upper) -vcvt2_high_bf16_fm8_fpm Conversion|Convert to BFloat16 (vector, upper) -vcvt1_f16_fm8_fpm Conversion|Convert to half-precision (vector, lower) -vcvt1_low_f16_fm8_fpm Conversion|Convert to half-precision (vector, lower) -vcvt2_f16_fm8_fpm Conversion|Convert to half-precision (vector, lower) -vcvt2_low_f16_fm8_fpm Conversion|Convert to half-precision (vector, lower) -vcvt1_high_f16_fm8_fpm Conversion|Convert to half-precision (vector, upper) -vcvt2_high_f16_fm8_fpm Conversion|Convert to half-precision (vector, upper) -vcvt_fm8_f32_fpm Conversion|Convert single-precision to floating point (vector, lower) +vcvt1_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, lower) +vcvt1_low_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, lower) +vcvt2_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, lower) +vcvt2_low_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, lower) +vcvt1_high_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, upper) +vcvt2_high_bf16_mf8_fpm Conversion|Convert to BFloat16 (vector, upper) +vcvt1_f16_mf8_fpm Conversion|Convert to half-precision (vector, lower) +vcvt1_low_f16_mf8_fpm Conversion|Convert to half-precision (vector, lower) +vcvt2_f16_mf8_fpm Conversion|Convert to half-precision (vector, lower) +vcvt2_low_f16_mf8_fpm Conversion|Convert to half-precision (vector, lower) +vcvt1_high_f16_mf8_fpm Conversion|Convert to half-precision (vector, upper) +vcvt2_high_f16_mf8_fpm Conversion|Convert to half-precision (vector, upper) +vcvt_mf8_f32_fpm Conversion|Convert single-precision to floating point (vector, lower) vcvt_high_f32_fpm Conversion|Convert single-precision to floating point (vector, upper) -vcvt_fm8_f16_fpm Conversion|Convert half-precision to floating point -vcvtq_fm8_f16_fpm Conversion|Convert half-precision to floating point +vcvt_mf8_f16_fpm Conversion|Convert half-precision to floating point +vcvtq_mf8_f16_fpm Conversion|Convert half-precision to floating point vscale_f16 Floating-point adjust exponent by vector vscaleq_f16 Floating-point adjust exponent by vector vscale_f32 Floating-point adjust exponent by vector vscaleq_f32 Floating-point adjust exponent by vector vscaleq_f64 Floating-point adjust exponent by vector -vdot_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector) -vdotq_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector) -vdot_lane_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) -vdot_laneq_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) -vdotq_lane_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) -vdotq_laneq_f32_fm8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) -vdot_f16_fm8_fpm Dot product|Floating-point dot product to half-precision (vector) -vdotq_f16_fm8_fpm Dot product|Floating-point dot product to half-precision (vector) -vdot_lane_f16_fm8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) -vdot_laneq_f16_fm8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) -vdotq_lane_f16_fm8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) -vdotq_laneq_f16_fm8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) -vmlalbq_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector) -vmlaltq_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector) -vmlalbq_lane_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) -vmlalbq_laneq_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) -vmlaltq_lane_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) -vmlaltq_laneq_f16_fm8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) -vmlallbbq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) -vmlallbtq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) -vmlalltbq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) -vmlallttq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) -vmlallbbq_lane_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlallbbq_laneq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlallbtq_lane_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlallbtq_laneq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlalltbq_lane_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlalltbq_laneq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlallttq_lane_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) -vmlallttq_laneq_f32_fm8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vdot_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector) +vdotq_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector) +vdot_lane_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) +vdot_laneq_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) +vdotq_lane_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) +vdotq_laneq_f32_mf8_fpm Dot product|Floating-point dot product to single-precision (vector, by element) +vdot_f16_mf8_fpm Dot product|Floating-point dot product to half-precision (vector) +vdotq_f16_mf8_fpm Dot product|Floating-point dot product to half-precision (vector) +vdot_lane_f16_mf8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) +vdot_laneq_f16_mf8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) +vdotq_lane_f16_mf8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) +vdotq_laneq_f16_mf8_fpm Dot product|Floating-point dot product to half-prevision (vector, by element) +vmlalbq_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector) +vmlaltq_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector) +vmlalbq_lane_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) +vmlalbq_laneq_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) +vmlaltq_lane_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) +vmlaltq_laneq_f16_mf8_fpm Multiply-add|Floating-point multiply-add long to half-precision (vector, by element) +vmlallbbq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) +vmlallbtq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) +vmlalltbq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) +vmlallttq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector) +vmlallbbq_lane_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlallbbq_laneq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlallbtq_lane_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlallbtq_laneq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlalltbq_lane_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlalltbq_laneq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlallttq_lane_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element) +vmlallttq_laneq_f32_mf8_fpm Multiply-add|Floating-point multiply-add long-long to single-precision (vector, by element)