From 16f9477a53c9f012f00ccb026d3f3fe7e8e298c1 Mon Sep 17 00:00:00 2001 From: Caroline Concatto Date: Tue, 2 May 2023 15:05:25 +0000 Subject: [PATCH 1/4] Add alpha support for SVE2.1 This patch adds new intrinsics and types for supporting SVE2.1. --- main/acle.md | 918 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 918 insertions(+) diff --git a/main/acle.md b/main/acle.md index b55633be..c3b26a4d 100644 --- a/main/acle.md +++ b/main/acle.md @@ -1851,6 +1851,10 @@ SVE language extensions: the Armv9-A SVE2 extension (FEAT_SVE2) and if the associated ACLE intrinsics are available. This implies that `__ARM_FEATURE_SVE` is nonzero. +`__ARM_FEATURE_SVE2p1` is defined to 1 if the FEAT_SVE2p1 instructions + are available and if the associated [ACLE features] +(#sme-language-extensions-and-intrinsics) are supported. + #### NEON-SVE Bridge macros `__ARM_NEON_SVE_BRIDGE` is defined to 1 if [NEON-SVE Bridge](#neon-sve-bridge) @@ -2369,6 +2373,7 @@ be found in [[BA]](#BA). | [`__ARM_FEATURE_SVE2_SHA3`](#sha3-extension) | SVE2 support for the SHA3 cryptographic extension (FEAT_SVE_SHA3) | 1 | | [`__ARM_FEATURE_SVE2_SM3`](#sm3-extension) | SVE2 support for the SM3 cryptographic extension (FEAT_SVE_SM3) | 1 | | [`__ARM_FEATURE_SVE2_SM4`](#sm4-extension) | SVE2 support for the SM4 cryptographic extension (FEAT_SVE_SM4) | 1 | +| [`__ARM_FEATURE_SVE2p1`](#sve2) | SVE version 2.1 (FEAT_SVE2p1) | [`__ARM_FEATURE_SYSREG128`](#bit-system-registers) | Support for 128-bit system registers (FEAT_SYSREG128) | 1 | | [`__ARM_FEATURE_UNALIGNED`](#unaligned-access-supported-in-hardware) | Hardware support for unaligned access | 1 | | [`__ARM_FP`](#hardware-floating-point) | Hardware floating-point | 1 | @@ -8636,6 +8641,397 @@ This is because the ACLE intrinsic calls do not imply a particular register allocation and so the code generator must decide for itself when move instructions are required. + +### SVE2 BFloat16 data-processing instructions. + +The instructions in this section are available when __ARM_FEATURE_B16B16 is +non-zero. + +#### BFADD, BFSUB + +BFloat16 floating-point add and sub (vectors) + +``` c + svbfloat16_t svadd[_bf16]_m (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svadd[_bf16]_x (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svadd[_bf16]_z (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svadd[_n_bf16]_m (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svadd[_n_bf16]_x (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svadd[_n_bf16]_z (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + + svbfloat16_t svsub[_bf16]_m (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svsub[_bf16]_x (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svsub[_bf16]_z (svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svsub[_n_bf16]_m (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svsub[_n_bf16]_x (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svsub[_n_bf16]_z (svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + ``` + +#### BFCLAMP + +BFloat16 Clamp to minimum/maximum vector. + +``` c + svbfloat16_t svclamp[_bf16](svbfloat16_t op, svbfloat16_t min, svbfloat16_t max); + ``` + +#### BFMAX, BFMIN + +BFloat16 floating-point maximum/minimum (predicated). + + ``` c + svbfloat16_t svmax[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmax[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmax[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmax[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmax[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmax[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + + svbfloat16_t svmin[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmin[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmin[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmin[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmin[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmin[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + ``` + +#### BFMAXNM, BFMINNM + +BFloat16 floating-point maximum/minimum number (predicated). + + ``` c + svbfloat16_t svmaxnm[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmaxnm[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmaxnm[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmaxnm[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmaxnm[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmaxnm[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + + svbfloat16_t svminnm[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svminnm[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svminnm[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svminnm[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svminnm[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svminnm[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + ``` + +#### BFMLA, BFMLS +BFloat16 floating-point fused multiply add or sub vectors. + + ``` c + svbfloat16_t svmla[_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmla[_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmla[_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmla[_n_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + svbfloat16_t svmla[_n_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + svbfloat16_t svmla[_n_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + + svbfloat16_t svmla_lane[_bf16](svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm, uint64_t imm_idx); + + svbfloat16_t svmls[_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmls[_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmls[_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm); + svbfloat16_t svmls[_n_bf16]_m(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + svbfloat16_t svmls[_n_bf16]_z(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + svbfloat16_t svmls[_n_bf16]_x(svbool_t pg, svbfloat16_t zda, svbfloat16_t zn, + bfloat16_t zm); + + svbfloat16_t svmls_lane[_bf16](svbfloat16_t zda, svbfloat16_t zn, + svbfloat16_t zm, uint64_t imm_idx); + ``` + +#### BFMUL + +BFloat16 floating-point multiply vectors. + + ``` c + svbfloat16_t svmul[_bf16]_m(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmul[_bf16]_x(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmul[_bf16]_z(svbool_t pg, svbfloat16_t zdn, svbfloat16_t zm); + svbfloat16_t svmul[_n_bf16]_m(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmul[_n_bf16]_x(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + svbfloat16_t svmul[_n_bf16]_z(svbool_t pg, svbfloat16_t zdn, bfloat16_t zm); + + svbfloat16_t svmul_lane[_bf16](svbfloat16_t zn, svbfloat16_t zm, + uint64_t imm_idx); + ``` + +### SVE2.1 instruction intrinsics + +The functions in this section are defined by the header file + [``](#arm_sve.h) when `__ARM_FEATURE_SVE2p1` is defined. + +Some instructions overlap with the SME and SME2 architecture extensions and +are additionally available in Streaming SVE mode when __ARM_FEATURE_SME is +non-zero or __ARM_FEATURE_SME2 are non-zero. +For convenience, these the intrinsics for these instructions are listed in +the following section. + +#### Multi-vector predicates + +When `__ARM_FEATURE_SVE2p1` is defined, [``](#arm_sve.h) defines the +tuple types `svboolx2_t` and `svboolx4_t`. + +These are opaque tuple types that can be accessed using the SVE intrinsics +`svsetN`, `svgetN` and `svcreateN`. `svundef2` and `svundef4` are also extended +to work with `svboolx2_t` and `svboolx4_t`. e.g. + +``` c + svbool_t svget2[_b](svboolx2_t tuple, uint64_t imm_index); + svboolx2_t svset2[_b](svboolx2_t tuple, uint64_t imm_index, svbool_t x); + svboolx2_t svcreate2[_b](svbool_t x, svbool_t y); + svboolx2_t svundef2_b(); +``` + +#### ADDQV, FADDQV + +Unsigned/FP add reduction of quadword vector segments. + +``` c + // Variants are also available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64, + // _f16, _f32, _f64 + uint8x16_t svaddqv[_u8](svbool_t pg, svuint8_t zn); + ``` + +#### ANDQV, EORQV, ORQV + +Reduction of quadword vector segments. + +``` c + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + uint8x16_t svandqv[_u8](svbool_t pg, svuint8_t zn); + uint8x16_t sveorqv[_u8](svbool_t pg, svuint8_t zn); + uint8x16_t svorqv[_u8](svbool_t pg, svuint8_t zn); + ``` + +#### DUPQ + +Broadcast indexed element within each quadword vector segment. + +``` c + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svuint8_t svdup_laneq[_u8](svuint8_t zn, uint64_t imm_idx); + ``` + +#### EXTQ + +Extract vector segment from each pair of quadword segments. + +``` c + // Variants are also available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64 + // _bf16, _f16, _f32, _f64 + svuint8_t svextq[_u8](svuint8_t zdn, svuint8_t zm, uint64_t imm); + ``` +#### LD1D, LD1W + +Contiguous zero-extending load to quadword (single vector). + +``` c + // Variants are also available for: + // _u32, _s32 + svfloat32_t svld1uwq[_f32](svbool_t, const float32_t *ptr); + svfloat32_t svld1uwq_vnum[_f32](svbool_t, const float32_t *ptr, int64_t vnum); + + + // Variants are also available for: + // _u64, _s64 + svfloat64_t svld1udq[_f64](svbool_t, const float64_t *ptr); + svfloat64_t svld1udq_vnum[_f64](svbool_t, const float64_t *ptr, int64_t vnum); + ``` + +#### LD1Q + +Gather Load Quadword. + +``` c + // Variants are also available for: + // _u8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svint8_t svld1q_gather[_u64base]_s8(svbool_t pg, svuint64_t zn); + svint8_t svld1q_gather[_u64base]_offset_s8(svbool_t pg, svuint64_t zn, int64_t offset); + svint8_t svld1q_gather_[u64]offset[_s8](svbool_t pg, const int8_t *base, svuint64_t offset); + + + // Variants are also available for: + // _u16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svint16_t svld1q_gather_[u64]index[_s16](svbool_t pg, const int16_t *base, svuint64_t index); + svint8_t svld1q_gather[_u64base]_index_s8(svbool_t pg, svuint64_t zn, int64_t index); + ``` + +#### LD2Q, LD3Q, LD4Q + +Contiguous load two, three or four quadword structures. + +``` c + // Variants are also available for: + // _u8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svint8x2_t svld2q[_s8](svbool_t pg, const int8_t *rn); + svint8x2_t svld2q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum); + svint8x3_t svld3q[_s8](svbool_t pg, const int8_t *rn); + svint8x3_t svld3q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum); + svint8x4_t svld4q[_s8](svbool_t pg, const int8_t *rn); + svint8x4_t svld4q_vnum[_s8](svbool_t pg, const int8_t *rn, uint64_t vnum); + ``` + +#### UMAXQV, SMAXQV, FMAXQV, UMINQV, SMINQV, FMINQV + +Max/Min reduction of quadword vector segments. + +``` c + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + // _f16, _f32, _f64 + uint8x16_t svmaxqv[_u8](svbool_t pg, svuint8_t zn); + uint8x16_t svminqv[_u8](svbool_t pg, svuint8_t zn); + ``` + +#### FMAXNMQV, FMINNMQV + +Max/Min recursive reduction of quadword vector segments. + +``` c + // Variants are also available for _f32, _f64 + float16x8_t svmaxnmqv[_f16](svbool_t pg, svfloat16_t zn); + float16x8_t svminnmqv[_f16](svbool_t pg, svfloat16_t zn); + ``` + +#### PMOV + +``` c + // Variants are available for: + // _s8, _u16, _s16, _s32, _u32, _s64, _u64 + svbool_t svpmov_lane[_u8](svuint8_t zn, uint64_t imm); + + // Variants are available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64 + svbool_t svpmov[_u8](svuint8_t zn); + + // Variants are available for: + // _s16, _s32, _u32, _s64, _u64 + svuint16_t svpmov_lane[_u16]_m(svuint16_t zd, svbool_t pn, uint64_t imm); + + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + svuint8_t svpmov_u8_z(svbool_t pn); + ``` + +#### ST1D, ST1W + +Contiguous store of single vector operand, truncating from quadword. + +``` c + // Variants are also available for: + // _u32, _s32 + void svst1wq[_f32](svbool_t, const float32_t *ptr, svfloat32_t data); + void svst1wq_vnum[_f32](svbool_t, const float32_t *ptr, int64_t vnum, svfloat32_t data); + + + // Variants are also available for: + // _u64, _s64 + void svst1dq[_f64](svbool_t, const float64_t *ptr, svfloat64_t data); + void svst1dq_vnum[_f64](svbool_t, const float64_t *ptr, int64_t vnum, svfloat64_t data); + ``` + +#### ST1Q + +Scatter store quadwords. + +``` c + // Variants are also available for: + // _u8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + void svst1q_scatter[_u64base][_s8](svbool_t pg, svuint64_t zn, svint8_t data); + void svst1q_scatter[_u64base]_offset[_s8](svbool_t pg, svuint64_t zn, int64_t offset, svint8_t data); + void svst1q_scatter_[u64]offset[_s8](svbool_t pg, const uint8_t *base, svuint64_t offset, svint8_t data); + + // Variants are also available for: + // _u16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + void svst1q_scatter[_u64base]_index[_s8](svbool_t pg, svuint64_t zn, int64_t index, svint8_t data); + void svst1q_scatter_[u64]index_[s16](svbool_t pg, const int16_t *base, svuint64_t index, svint16_t data); + ``` + +#### ST2Q, ST3Q, ST4Q + +Contiguous store. + +``` c + // Variants are also available for: + // _s8 _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + void svst2q[_u8](svbool_t pg, uint8_t *rn, svuint8x2_t zt); + void svst2q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x2_t zt); + void svst3q[_u8](svbool_t pg, uint8_t *rn, svuint8x3_t zt); + void svst3q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x3_t zt); + void svst4q[_u8](svbool_t pg, uint8_t *rn, svuint8x4_t zt); + void svst4q_vnum[_u8](svbool_t pg, uint8_t *rn, int64_t vnum, svuint8x4_t zt); + ``` + +#### TBLQ + +Programmable table lookup within each quadword vector segment (zeroing). + +``` c + // Variants are also available for: + // _u8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svint8_t svtblq[_s8](svint8_t zn, svuint8_t zm); + ``` + +#### TBXQ + +Programmable table lookup within each quadword vector segment (merging). + +``` c + // Variants are also available for: + // _u8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svint8_t svtbxq[_s8](svint8_t fallback, svint8_t zn, svuint8_t zm); + ``` + +#### UZPQ1, UZPQ2 + +Concatenate elements within each pair of quadword vector segments. + +``` c + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svuint8_t svuzpq1[_u8](svuint8_t zn, svuint8_t zm); + svuint8_t svuzpq2[_u8](svuint8_t zn, svuint8_t zm); + ``` + +#### ZIPQ1, ZIPQ2 + +Interleave elements from halves of each pair of quadword vector segments. + +``` c + // Variants are also available for: + // _s8, _u16, _s16, _u32, _s32, _u64, _s64 + // _bf16, _f16, _f32, _f64 + svuint8_t svzipq1[_u8](svuint8_t zn, svuint8_t zm); + svuint8_t svzipq2[_u8](svuint8_t zn, svuint8_t zm); + ``` + # SME language extensions and intrinsics The specification for SME is in @@ -11978,6 +12374,528 @@ are named after. All of the functions have external linkage. __arm_streaming_compatible; ``` +### SVE2.1 and SME2 instruction intrinsics + +These intrinsics can only be called from non-streaming code if +`__ARM_FEATURE_SVE2p1` is defined. They can only be called from streaming code +if the appropriate SME feature macro is defined (see next paragraph). +They can only be called from streaming-compatible code if they could be called +from both non-streaming code and streaming code + +The functions in this section are defined by either the header file + [``](#arm_sve.h) or [``](#arm_sme.h) +when `__ARM_FEATURE_SVE2.1` or `__ARM_FEATURE_SME2` is defined, respectively. + +Most function in this section are SME2 or SVE2.1. However some are available in +SME. For convinience the ones available in SME will be tagged in the function +with `[SME]`. + +#### UCLAMP, SCLAMP, FCLAMP + +Clamp to minimum/maximum vector. + +``` c + // Variants are also available for: + // _s8, _u8, _s16, _u16, _s32, _u32 [SME] + // _f32, _s64, _u64 and _f64 + svfloat16_t svclamp[_f16](svfloat16_t op, svfloat16_t min, svfloat16_t max); + ``` + +#### CNTP + +Set scalar to count from predicate-as-counter. ``vl`` is expected to be 2 or 4. + +``` c + // Variants are also available for _c16, _c32 and _c64 + uint64_t svcntp_c8(svcount_t pnn, uint64_t vl); + ``` + +#### UDOT, SDOT, FDOT (vectors) + +Multi-vector dot-product (2-way) + +``` c + // Variants are also available for _s32_s16 and _u32_u16 + svfloat32_t svdot[_f32_f16](svfloat32_t zda, svfloat16_t zn, + svfloat16_t zm); + ``` + +#### UDOT, SDOT, FDOT (indexed) + +Multi-vector dot-product (2-way) + +``` c + // Variants are also available for _s32_s16 and _u32_u16 + svfloat32_t svdot_lane[_f32_f16](svfloat32_t zda, svfloat16_t zn, + svfloat16_t zm, uint64_t imm_idx); + ``` + +#### LD1B, LD1D, LD1H, LD1W + +Contiguous load to multi-vector + +``` c + // Variants are also available for _s8 + svuint8x2_t svld1[_u8]_x2(svcount_t png, const uint8_t *rn); + + + // Variants are also available for _s8 + svuint8x4_t svld1[_u8]_x4(svcount_t png, const uint8_t *rn); + + + // Variants are also available for _s8 + svuint8x2_t svld1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, + int64_t vnum); + + + // Variants are also available for _s8 + svuint8x4_t svld1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, + int64_t vnum); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x2_t svld1[_u16]_x2(svcount_t png, const uint16_t *rn); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x4_t svld1[_u16]_x4(svcount_t png, const uint16_t *rn); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x2_t svld1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn, + int64_t vnum); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x4_t svld1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn, + int64_t vnum); + + + // Variants are also available for _s32 and _f32 + svuint32x2_t svld1[_u32]_x2(svcount_t png, const uint32_t *rn); + + + // Variants are also available for _s32 and _f32 + svuint32x4_t svld1[_u32]_x4(svcount_t png, const uint32_t *rn); + + + // Variants are also available for _s32 and _f32 + svuint32x2_t svld1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn, + int64_t vnum); + + + // Variants are also available for _s32 and _f32 + svuint32x4_t svld1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn, + int64_t vnum); + + + // Variants are also available for _s64 and _f64 + svuint64x2_t svld1[_u64]_x2(svcount_t png, const uint64_t *rn); + + + // Variants are also available for _s64 and _f64 + svuint64x4_t svld1[_u64]_x4(svcount_t png, const uint64_t *rn); + + + // Variants are also available for _s64 and _f64 + svuint64x2_t svld1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn, + int64_t vnum); + + + // Variants are also available for _s64 and _f64 + svuint64x4_t svld1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn, + int64_t vnum); + ``` + +#### LDNT1B, LDNT1D, LDNT1H, LDNT1W + +Contiguous non-temporal load to multi-vector + +``` c + // Variants are also available for _s8 + svuint8x2_t svldnt1[_u8]_x2(svcount_t png, const uint8_t *rn); + + + // Variants are also available for _s8 + svuint8x4_t svldnt1[_u8]_x4(svcount_t png, const uint8_t *rn); + + + // Variants are also available for _s8 + svuint8x2_t svldnt1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, + int64_t vnum); + + + // Variants are also available for _s8 + svuint8x4_t svldnt1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, + int64_t vnum); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x2_t svldnt1[_u16]_x2(svcount_t png, const uint16_t *rn); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x4_t svldnt1[_u16]_x4(svcount_t png, const uint16_t *rn); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x2_t svldnt1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn, + int64_t vnum); + + + // Variants are also available for _s16, _f16 and _bf16 + svuint16x4_t svldnt1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn, + int64_t vnum); + + + // Variants are also available for _s32 and _f32 + svuint32x2_t svldnt1[_u32]_x2(svcount_t png, const uint32_t *rn); + + + // Variants are also available for _s32 and _f32 + svuint32x4_t svldnt1[_u32]_x4(svcount_t png, const uint32_t *rn); + + + // Variants are also available for _s32 and _f32 + svuint32x2_t svldnt1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn, + int64_t vnum); + + + // Variants are also available for _s32 and _f32 + svuint32x4_t svldnt1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn, + int64_t vnum); + + + // Variants are also available for _s64 and _f64 + svuint64x2_t svldnt1[_u64]_x2(svcount_t png, const uint64_t *rn); + + + // Variants are also available for _s64 and _f64 + svuint64x4_t svldnt1[_u64]_x4(svcount_t png, const uint64_t *rn); + + + // Variants are also available for _s64 and _f64 + svuint64x2_t svldnt1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn, + int64_t vnum); + + + // Variants are also available for _s64 and _f64 + svuint64x4_t svldnt1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn, + int64_t vnum); + ``` + +#### BFMLSLB, BFMLSLT + +BFloat16 floating-point multiply-subtract long from single-precision (top/bottom) + +``` c + svfloat32_t svbfmlslb[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm); + + + svfloat32_t svbfmlslb_lane[_f32](svfloat32_t zda, svbfloat16_t zn, + svbfloat16_t zm, uint64_t imm_idx); + + + svfloat32_t svbfmlslt[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm); + + + svfloat32_t svbfmlslt_lane[_f32](svfloat32_t zda, svbfloat16_t zn, + svbfloat16_t zm, uint64_t imm_idx); + ``` + +#### PEXT + +Transform a predicate-as-counter to a predicate (pair). + +``` c + // Variants are also available for _c16, _c32 and _c64 + svbool_t svpext_lane_c8(svcount_t pnn, uint64_t imm); + + + // Variants are also available for _c16, _c32 and _c64 + svboolx2_t svpext_lane_c8_x2(svcount_t pnn, uint64_t imm); + ``` + +#### PSEL + +Predicate select between predicate value or all-false + +``` c + // Variants are also available for _b16, _b32 and _b64 [SME] + svbool_t svpsel_lane_b8(svbool_t pn, svbool_t pm, uint32_t idx); + + + // Variants are also available for _c16, _c32 and _c64 + svcount_t svpsel_lane_c8(svcount_t pn, svbool_t pm, uint32_t idx); + ``` + +#### PTRUE, PFALSE + +Initialise predicate-as-counter to all active or all inactive. + +``` c + // Variants are also available for _c16, _c32 and _c64 + svcount_t svptrue_c8(); + + + svcount_t svpfalse_c(void); +``` + +#### REVD + +Reverse doublewords in elements. + +``` c + // All the intrinsics below are [SME] + // Variants are available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64 + // _bf16, _f16, _f32, _f64 + svuint8_t svrevd[_u8]_m(svuint8_t zd, svbool_t pg, svuint8_t zn); + + + // Variants are available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64 + // _bf16, _f16, _f32, _f64 + svuint8_t svrevd[_u8]_z(svbool_t pg, svuint8_t zn); + + + // Variants are available for: + // _s8, _s16, _u16, _s32, _u32, _s64, _u64 + // _bf16, _f16, _f32, _f64 + svuint8_t svrevd[_u8]_x(svbool_t pg, svuint8_t zn); + ``` + +#### SQCVTN, SQCVTUN, UQCVTN + +Multi-vector saturating extract narrow and interleave + +``` c + // Variants are also available for _u16[_s32_x2] and _u16[_u32_x2] + svint16_t svqcvtn_s16[_s32_x2](svint32x2_t zn); + ``` + +#### SQRSHRN, UQRSHRN + +Multi-vector saturating rounding shift right narrow and interleave + +``` c + // Variants are also available for _u16[_u32_x2] + svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm); + ``` + +#### SQRSHRUN + +Multi-vector saturating rounding shift right unsigned narrow and interleave + +``` c + svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm); + ``` + +#### ST1B, ST1D, ST1H, ST1W + +Contiguous store of multi-vector operand + +``` c + // Variants are also available for _s8_x2 + void svst1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt); + + + // Variants are also available for _s8_x4 + void svst1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt); + + + // Variants are also available for _s8_x2 + void svst1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, + svuint8x2_t zt); + + + // Variants are also available for _s8_x4 + void svst1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, + svuint8x4_t zt); + + + // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 + void svst1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt); + + + // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 + void svst1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt); + + + // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 + void svst1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum, + svuint16x2_t zt); + + + // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 + void svst1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum, + svuint16x4_t zt); + + + // Variants are also available for _s32_x2 and _f32_x2 + void svst1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt); + + + // Variants are also available for _s32_x4 and _f32_x4 + void svst1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt); + + + // Variants are also available for _s32_x2 and _f32_x2 + void svst1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum, + svuint32x2_t zt); + + + // Variants are also available for _s32_x4 and _f32_x4 + void svst1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum, + svuint32x4_t zt); + + + // Variants are also available for _s64_x2 and _f64_x2 + void svst1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt); + + + // Variants are also available for _s64_x4 and _f64_x4 + void svst1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt); + + + // Variants are also available for _s64_x2 and _f64_x2 + void svst1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum, + svuint64x2_t zt); + + + // Variants are also available for _s64_x4 and _f64_x4 + void svst1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum, + svuint64x4_t zt); + ``` + +#### STNT1B, STNT1D, STNT1H, STNT1W + +Contiguous non-temporal store of multi-vector operand + +``` c + // Variants are also available for _s8_x2 + void svstnt1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt); + + + // Variants are also available for _s8_x4 + void svstnt1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt); + + + // Variants are also available for _s8_x2 + void svstnt1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, + svuint8x2_t zt); + + + // Variants are also available for _s8_x4 + void svstnt1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, + svuint8x4_t zt); + + + // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 + void svstnt1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt); + + + // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 + void svstnt1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt); + + + // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 + void svstnt1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum, + svuint16x2_t zt); + + + // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 + void svstnt1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum, + svuint16x4_t zt); + + + // Variants are also available for _s32_x2 and _f32_x2 + void svstnt1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt); + + + // Variants are also available for _s32_x4 and _f32_x4 + void svstnt1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt); + + + // Variants are also available for _s32_x2 and _f32_x2 + void svstnt1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum, + svuint32x2_t zt); + + + // Variants are also available for _s32_x4 and _f32_x4 + void svstnt1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum, + svuint32x4_t zt); + + + // Variants are also available for _s64_x2 and _f64_x2 + void svstnt1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt); + + + // Variants are also available for _s64_x4 and _f64_x4 + void svstnt1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt); + + + // Variants are also available for _s64_x2 and _f64_x2 + void svstnt1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum, + svuint64x2_t zt); + + + // Variants are also available for _s64_x4 and _f64_x4 + void svstnt1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum, + svuint64x4_t zt); + ``` + +#### WHILEGE, WHILEGT, WHILEHI, WHILEHS, WHILELE, WHILELO, WHILELS, WHILELT + +While (resulting in predicate-as-counter). ``vl`` is expected to be 2 or 4. + +``` c + // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], + // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] + svcount_t svwhilege_c8[_s64](int64_t rn, int64_t rm, uint64_t vl); + + + // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], + // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] + svcount_t svwhilegt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl); + + + // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], + // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] + svcount_t svwhilele_c8[_s64](int64_t rn, int64_t rm, uint64_t vl); + + + // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], + // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] + svcount_t svwhilelt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl); + ``` + +While (resulting in predicate tuple) + +``` c + // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, + // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and + // _b64[_u64]_x2 + svboolx2_t svwhilege_b8[_s64]_x2(int64_t rn, int64_t rm); + + + // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, + // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and + // _b64[_u64]_x2 + svboolx2_t svwhilegt_b8[_s64]_x2(int64_t rn, int64_t rm); + + + // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, + // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and + // _b64[_u64]_x2 + svboolx2_t svwhilele_b8[_s64]_x2(int64_t rn, int64_t rm); + + + // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, + // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and + // _b64[_u64]_x2 + svboolx2_t svwhilelt_b8[_s64]_x2(int64_t rn, int64_t rm); + ``` # M-profile Vector Extension (MVE) intrinsics From e9e34505ea41d122683764ba005caa574320f8b5 Mon Sep 17 00:00:00 2001 From: Caroline Concatto Date: Thu, 4 Apr 2024 16:13:22 +0000 Subject: [PATCH 2/4] Remove from SME2 intriniscs that are common with SVE2.1 --- main/acle.md | 541 --------------------------------------------------- 1 file changed, 541 deletions(-) diff --git a/main/acle.md b/main/acle.md index c3b26a4d..6922382d 100644 --- a/main/acle.md +++ b/main/acle.md @@ -10559,37 +10559,11 @@ Multi-vector saturating extract narrow Multi-vector saturating extract narrow and interleave ``` c - // Variants are also available for _u16[_s32_x2] and _u16[_u32_x2] - svint16_t svqcvtn_s16[_s32_x2](svint32x2_t zn) __arm_streaming_compatible; - - // Variants are also available for _u8[_s32_x4], _u8[_u32_x4], _s16[_s64_x4], // _u16[_s64_x4] and _u16[_u64_x4] svint8_t svqcvtn_s8[_s32_x4](svint32x4_t zn) __arm_streaming; ``` -#### UDOT, SDOT, FDOT (vectors) - -Multi-vector dot-product (2-way) - -``` c - // Variants are also available for _s32_s16 and _u32_u16 - svfloat32_t svdot[_f32_f16](svfloat32_t zda, svfloat16_t zn, - svfloat16_t zm) - __arm_streaming_compatible; - ``` - -#### UDOT, SDOT, FDOT (indexed) - -Multi-vector dot-product (2-way) - -``` c - // Variants are also available for _s32_s16 and _u32_u16 - svfloat32_t svdot_lane[_f32_f16](svfloat32_t zda, svfloat16_t zn, - svfloat16_t zm, uint64_t imm_idx) - __arm_streaming_compatible; - ``` - #### FDOT, BFDOT, SUDOT, USDOT, SDOT, UDOT (store into ZA, single) Multi-vector dot-product (2-way and 4-way) @@ -11305,29 +11279,6 @@ Multi-vector multiply-subtract long long (widening) __arm_streaming __arm_inout("za"); ``` -#### BFMLSLB, BFMLSLT - -BFloat16 floating-point multiply-subtract long from single-precision (top/bottom) - -``` c - svfloat32_t svbfmlslb[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming_compatible; - - - svfloat32_t svbfmlslb_lane[_f32](svfloat32_t zda, svbfloat16_t zn, - svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming_compatible; - - - svfloat32_t svbfmlslt[_f32](svfloat32_t zda, svbfloat16_t zn, svbfloat16_t zm) - __arm_streaming_compatible; - - - svfloat32_t svbfmlslt_lane[_f32](svfloat32_t zda, svbfloat16_t zn, - svbfloat16_t zm, uint64_t imm_idx) - __arm_streaming_compatible; - ``` - #### SMAX, SMIN, UMAX, UMIN, FMAX, FMIN (single) Multi-vector min/max @@ -11469,378 +11420,6 @@ Multi-vector floating-point round to integral value svfloat32x4_t svrintp[_f32_x4](svfloat32x4_t zn) __arm_streaming; ``` -#### LD1B, LD1D, LD1H, LD1W - -Contiguous load to multi-vector - -``` c - // Variants are also available for _s8 - svuint8x2_t svld1[_u8]_x2(svcount_t png, const uint8_t *rn) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x4_t svld1[_u8]_x4(svcount_t png, const uint8_t *rn) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x2_t svld1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x4_t svld1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x2_t svld1[_u16]_x2(svcount_t png, const uint16_t *rn) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x4_t svld1[_u16]_x4(svcount_t png, const uint16_t *rn) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x2_t svld1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x4_t svld1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x2_t svld1[_u32]_x2(svcount_t png, const uint32_t *rn) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x4_t svld1[_u32]_x4(svcount_t png, const uint32_t *rn) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x2_t svld1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x4_t svld1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x2_t svld1[_u64]_x2(svcount_t png, const uint64_t *rn) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x4_t svld1[_u64]_x4(svcount_t png, const uint64_t *rn) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x2_t svld1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x4_t svld1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn, - int64_t vnum) - __arm_streaming; - ``` - -#### LDNT1B, LDNT1D, LDNT1H, LDNT1W - -Contiguous non-temporal load to multi-vector - -``` c - // Variants are also available for _s8 - svuint8x2_t svldnt1[_u8]_x2(svcount_t png, const uint8_t *rn) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x4_t svldnt1[_u8]_x4(svcount_t png, const uint8_t *rn) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x2_t svldnt1_vnum[_u8]_x2(svcount_t png, const uint8_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s8 - svuint8x4_t svldnt1_vnum[_u8]_x4(svcount_t png, const uint8_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x2_t svldnt1[_u16]_x2(svcount_t png, const uint16_t *rn) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x4_t svldnt1[_u16]_x4(svcount_t png, const uint16_t *rn) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x2_t svldnt1_vnum[_u16]_x2(svcount_t png, const uint16_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s16, _f16 and _bf16 - svuint16x4_t svldnt1_vnum[_u16]_x4(svcount_t png, const uint16_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x2_t svldnt1[_u32]_x2(svcount_t png, const uint32_t *rn) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x4_t svldnt1[_u32]_x4(svcount_t png, const uint32_t *rn) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x2_t svldnt1_vnum[_u32]_x2(svcount_t png, const uint32_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s32 and _f32 - svuint32x4_t svldnt1_vnum[_u32]_x4(svcount_t png, const uint32_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x2_t svldnt1[_u64]_x2(svcount_t png, const uint64_t *rn) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x4_t svldnt1[_u64]_x4(svcount_t png, const uint64_t *rn) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x2_t svldnt1_vnum[_u64]_x2(svcount_t png, const uint64_t *rn, - int64_t vnum) - __arm_streaming; - - - // Variants are also available for _s64 and _f64 - svuint64x4_t svldnt1_vnum[_u64]_x4(svcount_t png, const uint64_t *rn, - int64_t vnum) - __arm_streaming; - ``` - -#### ST1B, ST1D, ST1H, ST1W - -Contiguous store of multi-vector operand - -``` c - // Variants are also available for _s8_x2 - void svst1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x4 - void svst1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x2 - void svst1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, - svuint8x2_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x4 - void svst1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, - svuint8x4_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 - void svst1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 - void svst1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 - void svst1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum, - svuint16x2_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 - void svst1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum, - svuint16x4_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x2 and _f32_x2 - void svst1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x4 and _f32_x4 - void svst1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x2 and _f32_x2 - void svst1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum, - svuint32x2_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x4 and _f32_x4 - void svst1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum, - svuint32x4_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x2 and _f64_x2 - void svst1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x4 and _f64_x4 - void svst1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x2 and _f64_x2 - void svst1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum, - svuint64x2_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x4 and _f64_x4 - void svst1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum, - svuint64x4_t zt) - __arm_streaming; - ``` - -#### STNT1B, STNT1D, STNT1H, STNT1W - -Contiguous non-temporal store of multi-vector operand - -``` c - // Variants are also available for _s8_x2 - void svstnt1[_u8_x2](svcount_t png, uint8_t *rn, svuint8x2_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x4 - void svstnt1[_u8_x4](svcount_t png, uint8_t *rn, svuint8x4_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x2 - void svstnt1_vnum[_u8_x2](svcount_t png, uint8_t *rn, int64_t vnum, - svuint8x2_t zt) - __arm_streaming; - - - // Variants are also available for _s8_x4 - void svstnt1_vnum[_u8_x4](svcount_t png, uint8_t *rn, int64_t vnum, - svuint8x4_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 - void svstnt1[_u16_x2](svcount_t png, uint16_t *rn, svuint16x2_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 - void svstnt1[_u16_x4](svcount_t png, uint16_t *rn, svuint16x4_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x2, _f16_x2 and _bf16_x2 - void svstnt1_vnum[_u16_x2](svcount_t png, uint16_t *rn, int64_t vnum, - svuint16x2_t zt) - __arm_streaming; - - - // Variants are also available for _s16_x4, _f16_x4 and _bf16_x4 - void svstnt1_vnum[_u16_x4](svcount_t png, uint16_t *rn, int64_t vnum, - svuint16x4_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x2 and _f32_x2 - void svstnt1[_u32_x2](svcount_t png, uint32_t *rn, svuint32x2_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x4 and _f32_x4 - void svstnt1[_u32_x4](svcount_t png, uint32_t *rn, svuint32x4_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x2 and _f32_x2 - void svstnt1_vnum[_u32_x2](svcount_t png, uint32_t *rn, int64_t vnum, - svuint32x2_t zt) - __arm_streaming; - - - // Variants are also available for _s32_x4 and _f32_x4 - void svstnt1_vnum[_u32_x4](svcount_t png, uint32_t *rn, int64_t vnum, - svuint32x4_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x2 and _f64_x2 - void svstnt1[_u64_x2](svcount_t png, uint64_t *rn, svuint64x2_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x4 and _f64_x4 - void svstnt1[_u64_x4](svcount_t png, uint64_t *rn, svuint64x4_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x2 and _f64_x2 - void svstnt1_vnum[_u64_x2](svcount_t png, uint64_t *rn, int64_t vnum, - svuint64x2_t zt) - __arm_streaming; - - - // Variants are also available for _s64_x4 and _f64_x4 - void svstnt1_vnum[_u64_x4](svcount_t png, uint64_t *rn, int64_t vnum, - svuint64x4_t zt) - __arm_streaming; - ``` - #### LDR, STR Spill and fill of ZT0 @@ -11997,62 +11576,11 @@ Move multi-vectors to/from ZA __arm_streaming __arm_inout("za"); ``` -#### PTRUE - -Initialise predicate-as-counter to all active or all inactive. - -``` c - // Variants are also available for _c16, _c32 and _c64 - svcount_t svptrue_c8() __arm_streaming; - - - svcount_t svpfalse_c(void) __arm_streaming_compatible; -``` - - -#### PEXT - -Transform a predicate-as-counter to a predicate (pair). - -``` c - // Variants are also available for _c16, _c32 and _c64 - svbool_t svpext_lane_c8(svcount_t pnn, uint64_t imm) __arm_streaming; - - - // Variants are also available for _c16, _c32 and _c64 - svboolx2_t svpext_lane_c8_x2(svcount_t pnn, uint64_t imm) __arm_streaming; - ``` - -#### PSEL - -Predicate select between predicate value or all-false - -``` c - // Variants are also available for _c16, _c32 and _c64 - svcount_t svpsel_lane_c8(svcount_t pn, svbool_t pm, uint32_t idx) - __arm_streaming_compatible; - ``` - -#### CNTP - -Set scalar to count from predicate-as-counter. ``vl`` is expected to be 2 or 4. - -``` c - // Variants are also available for _c16, _c32 and _c64 - uint64_t svcntp_c8(svcount_t pnn, uint64_t vl) __arm_streaming; - ``` - #### UCLAMP, SCLAMP, FCLAMP Multi-vector clamp to minimum/maximum vector ``` c - // Variants are also available for _s8, _u8, _s16, _u16, _s32, _u32, _f32, - // _s64, _u64 and _f64 - svfloat16_t svclamp[_f16](svfloat16_t zd, svfloat16_t zn, svfloat16_t zm) - __arm_streaming_compatible; - - // Variants are also available for _single_s8_x2, _single_u8_x2, // _single_s16_x2, _single_u16_x2, _single_s32_x2, _single_u32_x2, // _single_f32_x2, _single_s64_x2, _single_u64_x2 and _single_f64_x2 @@ -12149,11 +11677,6 @@ Multi-vector saturating rounding shift right narrow and interleave __arm_streaming; - // Variants are also available for _u16[_u32_x2] - svint16_t svqrshrn[_n]_s16[_s32_x2](svint32x2_t zn, uint64_t imm) - __arm_streaming_compatible; - - // Variants are also available for _u16[_u64_x4] svint16_t svqrshrn[_n]_s16[_s64_x4](svint64x4_t zn, uint64_t imm) __arm_streaming; @@ -12181,10 +11704,6 @@ Multi-vector saturating rounding shift right unsigned narrow Multi-vector saturating rounding shift right unsigned narrow and interleave ``` c - svuint16_t svqrshrun[_n]_u16[_s32_x2](svint32x2_t zn, uint64_t imm) - __arm_streaming_compatible; - - // Variants are also available for _u16[_s64_x4] svuint8_t svqrshrun[_n]_u8[_s32_x4](svint32x4_t zn, uint64_t imm) __arm_streaming; @@ -12220,66 +11739,6 @@ Multi-vector signed saturating doubling multiply high svint8x4_t svqdmulh[_s8_x4](svint8x4_t zdn, svint8x4_t zm) __arm_streaming; ``` -#### WHILEGE, WHILEGT, WHILEHI, WHILEHS, WHILELE, WHILELO, WHILELS, WHILELT - -While (resulting in predicate-as-counter). ``vl`` is expected to be 2 or 4. - -``` c - // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], - // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] - svcount_t svwhilege_c8[_s64](int64_t rn, int64_t rm, uint64_t vl) - __arm_streaming; - - - // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], - // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] - svcount_t svwhilegt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl) - __arm_streaming; - - - // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], - // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] - svcount_t svwhilele_c8[_s64](int64_t rn, int64_t rm, uint64_t vl) - __arm_streaming; - - - // Variants are also available for _c16[_s64], _c32[_s64] _c64[_s64], - // _c8[_u64], _c16[_u64], _c32[_u64] and _c64[_u64] - svcount_t svwhilelt_c8[_s64](int64_t rn, int64_t rm, uint64_t vl) - __arm_streaming; - ``` - -While (resulting in predicate tuple) - -``` c - // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, - // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and - // _b64[_u64]_x2 - svboolx2_t svwhilege_b8[_s64]_x2(int64_t rn, int64_t rm) - __arm_streaming_compatible; - - - // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, - // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and - // _b64[_u64]_x2 - svboolx2_t svwhilegt_b8[_s64]_x2(int64_t rn, int64_t rm) - __arm_streaming_compatible; - - - // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, - // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and - // _b64[_u64]_x2 - svboolx2_t svwhilele_b8[_s64]_x2(int64_t rn, int64_t rm) - __arm_streaming_compatible; - - - // Variants are also available for _b16[_s64]_x2, _b32[_s64]_x2, - // _b64[_s64]_x2, _b8[_u64]_x2, _b16[_u64]_x2, _b32[_u64]_x2 and - // _b64[_u64]_x2 - svboolx2_t svwhilelt_b8[_s64]_x2(int64_t rn, int64_t rm) - __arm_streaming_compatible; - ``` - #### SUNPK, UUNPK Multi-vector pack/unpack From ab72e2b65202fcf1731c6ed664b7fbb07941e9e7 Mon Sep 17 00:00:00 2001 From: Caroline Concatto Date: Fri, 5 Apr 2024 13:46:53 +0000 Subject: [PATCH 3/4] Address review comments --- main/acle.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/main/acle.md b/main/acle.md index 6922382d..ecc078e5 100644 --- a/main/acle.md +++ b/main/acle.md @@ -8776,8 +8776,8 @@ The functions in this section are defined by the header file Some instructions overlap with the SME and SME2 architecture extensions and are additionally available in Streaming SVE mode when __ARM_FEATURE_SME is non-zero or __ARM_FEATURE_SME2 are non-zero. -For convenience, these the intrinsics for these instructions are listed in -the following section. +For convenience, the intrinsics fo these instructions are listed in the + following section. #### Multi-vector predicates @@ -11835,16 +11835,16 @@ are named after. All of the functions have external linkage. ### SVE2.1 and SME2 instruction intrinsics +The functions in this section are defined by either the header file + [``](#arm_sve.h) or [``](#arm_sme.h) +when `__ARM_FEATURE_SVE2.1` or `__ARM_FEATURE_SME2` is defined, respectively. + These intrinsics can only be called from non-streaming code if `__ARM_FEATURE_SVE2p1` is defined. They can only be called from streaming code -if the appropriate SME feature macro is defined (see next paragraph). +if the appropriate SME feature macro is defined (see previous paragraph). They can only be called from streaming-compatible code if they could be called from both non-streaming code and streaming code -The functions in this section are defined by either the header file - [``](#arm_sve.h) or [``](#arm_sme.h) -when `__ARM_FEATURE_SVE2.1` or `__ARM_FEATURE_SME2` is defined, respectively. - Most function in this section are SME2 or SVE2.1. However some are available in SME. For convinience the ones available in SME will be tagged in the function with `[SME]`. From 19c0a2b1b42528b46b78c3bc3d0c17685b117c08 Mon Sep 17 00:00:00 2001 From: Caroline Concatto Date: Fri, 12 Apr 2024 11:01:14 +0000 Subject: [PATCH 4/4] Fix typo --- main/acle.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/main/acle.md b/main/acle.md index ecc078e5..5396919b 100644 --- a/main/acle.md +++ b/main/acle.md @@ -8776,7 +8776,7 @@ The functions in this section are defined by the header file Some instructions overlap with the SME and SME2 architecture extensions and are additionally available in Streaming SVE mode when __ARM_FEATURE_SME is non-zero or __ARM_FEATURE_SME2 are non-zero. -For convenience, the intrinsics fo these instructions are listed in the +For convenience, the intrinsics for these instructions are listed in the following section. #### Multi-vector predicates