SVE | SME (WIP, no diagrams)

A64 SIMD Instruction List: SVE Instructions

This is inspired by and based on the x86/x64 SIMD Instruction List by Daytime.

This is not an official reference, and may contain mistakes. It is intended to make it easier to find instructions, and to provide an alternative perspective. While writing SVE code, please refer to the Arm® Exploration Tools, Arm® ARM, or Arm® Intrinsics Reference.

Merging and zeroing predication is typically omitted from the diagrams, but it is shown in operations like BRKN and LD1RQB that use the /Z syntax but have unusual semantics.

This is an ongoing project - dark-red links are missing full descriptions, bright-red links are also missing diagrams and instead link to the documentation in the exploration tools.

Report mistakes or send feedback.

Target

Note: this does not support filtering by vector length, so some unavailable operations may appear available even after selecting a preset.

Warning: this allows contradictory and invalid configurations.

SVE Version SME Version Mode Extensions Presets










Move

128-bit 64-bit 32-bit 16-bit 8-bit
zip
unzip
transpose
broadcast
svdup_lane[_{s,u,f}64]
svdup_lane[_{s,u,f}32]
svdup_lane[_{s,u,f,bf}16]
reverse vector
svrev[_{s,u,f}64]
svrev[_{s,u,f}32]
svrev[_{s,u,f,bf}16]
reverse within elements
extract
compact
table lookup (shuffle)
splice
extract last active element
broadcast last active element
extract element after last active
broadcast element after last active
insert
make linear sequence
select
move to predicate
move from predicate

Conversions

from \ to Integer Floating-Point
64-bit 32-bit 16-bit 8-bit double single half BFloat16
Integer64-bit
32-bit
16-bit
8-bit
Floating-Pointdouble
single
half

Data Processing

Arithmetic Operations

Integer Floating-Point
64-bit 32-bit 16-bit 8-bit double single half BFloat16
add
add (half-width element)
add (double-width result)
add (narrowing, high part)
average (halving add)
add reduction
pairwise add
add pairs to double-width accumulator
add with carry
add with left shift / extend
svadrb[_u64base]_[{s,u}64]offset
svadrb[_u32base]_[{s,u}32]offset
sub
sub (half-width element)
sub (double-width result)
halving sub
sub (narrowing, high part)
sub with carry
mul
mul (high)
mul (double-width result)
div
neg
abs
clamp
min
pairwise min
min reduction
max
pairwise max
max reduction
fused multiply add / sub
negated fused multiply add / sub
fused multiply add / sub (double-width result)
matrix multiply-add
dot product (multiply-add)
absolute difference
absolute difference (double-width result)
add absolute difference
add absolute difference (double-width accumulator)
multiply nth power of 2
round
square root
recpriocal square root
reciprocal
trigonometric acceleration
exponential acceleration
normalization
log base 2 (integer)
doubling mul high
doubling mul and add high
doubling mul and sub high
doubling mul (double-width result)
doubling mul and add (double-width result)
doubling mul and sub (double-width result)

Complex Arithmetic

Integer Floating-Point
64-bit 32-bit 16-bit 8-bit double single half
complex add
complex multiply-add
complex doubling mul and add high
complex dot-product

Bitwise Operations

64-bit 32-bit 16-bit 8-bit
and
svand[_{s,u}{8,16,32,64}]_x
or
svorr[_{s,u}{8,16,32,64}]_x
xor
sveor[_{s,u}{8,16,32,64}]_x
not
and not
svbic[_{s,u}{8,16,32,64}]_x
3-way xor
and not, xor
bitwise select
and reduction
or reduction
xor reduction

Bit Shift / Rotate

64-bit 32-bit 16-bit 8-bit
shift left
shift right logical
shift right arithmetic
shift right arithmetic, rounding towards zero
bidirectional shifts
narrowing right shift
interleaving narrowing right shift
widening shift left
shift right and add
shift right and insert
shift left and insert
xor and rotate

Other Logical Operations

64-bit 32-bit 16-bit 8-bit
logical not
count leading zeros
count leading sign bits
count non-zero bits
reverse bits
bit deposit
bit extract
bit group
polynomial multiply
widening polynomial multiply
svpmullb[_u64]
svpmullb_pair[_u32]
svpmullt[_u64]
svpmullt_pair[_u32]
svpmullb[_u16]
svpmullb_pair[_u8]
svpmullt[_u16]
svpmullt_pair[_u8]

String

64-bit 32-bit 16-bit 8-bit
find matching elements
find non-matching elements
count matching elements (within 128-bit segments)
count matching elements (prefix-inclusive)

Predication

Compare

Integer Floating-Point
64-bit 32-bit 16-bit 8-bit double single half
compare for ==
compare for !=
compare for <
compare for >
compare for ≤
compare for ≥
compare for unordered
compare absolute value for >
compare absolute value for ≥

Predicate Operations

64-bit 32-bit 16-bit 8-bit
set predicate register
zero predicate register
zip
unzip
transpose
reverse
unpack
extract predicate from predicate-as-counter

Predicate Bitwise Operations

64-bit 32-bit 16-bit 8-bit
test
svptest_any
svptest_first
svptest_last
and
or
xor
and not
or not
not and
not or
select

Predicate Loop Control Operations

64-bit 32-bit 16-bit 8-bit
while < (signed)
while > (signed)
while ≤ (signed)
while ≥ (signed)
while < (unsigned)
while > (unsigned)
while ≤ (unsigned)
while ≥ (unsigned)
while no read-after-write conflict
svwhilerw[_{s,u,f,bf}16]
while no write-after-read/write conflict
svwhilewr[_{s,u,f,bf}16]
break after
break after (propagating)
break before
break before (propagating)
propagate break

Predicate Iteration Operations

64-bit 32-bit 16-bit 8-bit
find first active element
find next active element

Other Predicate Operations

64-bit 32-bit 16-bit 8-bit
and with indexed bit
count predicate
increment by predicate count
svqincp[_n_s64]_b64
svqincp[_n_u32]_b64
svqincp[_n_u64]_b64
svqincp[_n_s64]_b32
svqincp[_n_u32]_b32
svqincp[_n_u64]_b32
svqincp[_n_s64]_b16
svqincp[_n_u32]_b16
svqincp[_n_u64]_b16
decrement by predicate count
svqdecp[_n_s64]_b64
svqdecp[_n_u32]_b64
svqdecp[_n_u64]_b64
svqdecp[_n_s64]_b32
svqdecp[_n_u32]_b32
svqdecp[_n_u64]_b32
svqdecp[_n_s64]_b16
svqdecp[_n_u32]_b16
svqdecp[_n_u64]_b16

Predicate Constraint Operations

64-bit 32-bit 16-bit 8-bit
count predicate constraint
increment by predicate constraint count
decrement vector by predicate constraint count

Loads and Stores

Contiguous Loads

128-bit 64-bit 32-bit 16-bit 8-bit
load (unpredicated)
load (predicated)
load (predicated by counter)
load and deinterleave
load and broadcast
load and replicate 128-bit segment
load and replicate 256-bit segment

Gathers

register lane size
128-bit 64-bit 32-bit
memory
element
size
128-bit
64-bit
svld1_gather[_u64base]_offset_{s,u,f}64
svld1_gather_[{s,u}64]offset[_{s,u,f}64]
32-bit
svld1uw_gather[_u64base]_offset_{s,u}64
svld1uw_gather_[{s,u}64]offset_{s,u}64
svld1sw_gather[_u64base]_offset_{s,u}64
svld1sw_gather_[{s,u}64]offset_{s,u}64
svld1_gather[_u32base]_offset_{s,u,f}32
16-bit
svld1uh_gather[_u64base]_offset_{s,u}64
svld1uh_gather_[{s,u}64]offset_{s,u}64
svld1sh_gather[_u64base]_offset_{s,u}64
svld1sh_gather_[{s,u}64]offset_{s,u}64
svld1uh_gather[_u32base]_offset_{s,u}32
svld1sh_gather[_u32base]_offset_{s,u}32
8-bit
svld1ub_gather[_u64base]_offset_{s,u}64
svld1ub_gather_[{s,u}64]offset_{s,u}64
svld1sb_gather[_u64base]_offset_{s,u}64
svld1sb_gather_[{s,u}64]offset_{s,u}64
svld1ub_gather[_u32base]_offset_{s,u}32
svld1sb_gather[_u32base]_offset_{s,u}32

Contiguous Stores

128-bit 64-bit 32-bit 16-bit 8-bit
store (unpredicated)
store (predicated)
store (predicated by counter)
store and interleave

Scatters

register lane size
128-bit 64-bit 32-bit
memory
element
size
128-bit
64-bit
svst1_scatter[_u64base]_offset[_{s,u,f}64]
svst1_scatter_[{s,u}64]offset[_{s,u,f}64]
32-bit
svst1w_scatter[_u64base]_offset[_{s,u}64]
svst1w_scatter_[{s,u}64]offset[_{s,u}64]
svst1_scatter[_u32base]_offset[_{s,u,f}32]
16-bit
svst1h_scatter[_u64base]_offset[_{s,u}64]
svst1h_scatter_[{s,u}64]offset[_{s,u}64]
svst1h_scatter[_u32base]_offset[_{s,u}32]
8-bit
svst1b_scatter[_u64base]_offset[_{s,u}64]
svst1b_scatter_[{s,u}64]offset[_{s,u}64]
svst1b_scatter[_u32base]_offset[_{s,u}32]

First/Non-Faulting Loads

64-bit 32-bit 16-bit 8-bit
load first-fault
load non-fault
gather first-fault
svldff1ub_gather[_u64base]_offset_{s,u}64
svldff1uh_gather[_u64base]_offset_{s,u}64
svldff1uw_gather[_u64base]_offset_{s,u}64
svldff1_gather[_u64base]_offset_{s,u,f}64
svldff1ub_gather_[{s,u}64]offset_{s,u}64
svldff1sb_gather_[{s,u}64]offset_{s,u}64
svldff1uh_gather_[{s,u}64]offset_{s,u}64
svldff1sh_gather_[{s,u}64]offset_{s,u}64
svldff1uw_gather_[{s,u}64]offset_{s,u}64
svldff1sw_gather_[{s,u}64]offset_{s,u}64
svldff1_gather_[{s,u}64]offset[_{s,u,f}64]
svldff1ub_gather[_u32base]_offset_{s,u}32
svldff1uh_gather[_u32base]_offset_{s,u}32
svldff1_gather[_u32base]_offset_{s,u,f}32
read first-fault register
write first-fault register
svsetffr

Other Memory Operations

64-bit 32-bit 16-bit 8-bit
load non-temporal
store non-temporal
load non-temporal (predicated by counter)
store non-temporal (predicated by counter)
gather non-temporal
svldnt1ub_gather_[{s,u}64]offset_{s,u}64
svldnt1ub_gather[_u64base]_offset_{s,u}64
svldnt1sb_gather_[{s,u}64]offset_{s,u}64
svldnt1sb_gather[_u64base]_offset_{s,u}64
svldnt1uh_gather_[{s,u}64]offset_{s,u}64
svldnt1uh_gather[_u64base]_offset_{s,u}64
svldnt1sh_gather_[{s,u}64]offset_{s,u}64
svldnt1sh_gather[_u64base]_offset_{s,u}64
svldnt1uw_gather_[{s,u}64]offset_{s,u}64
svldnt1uw_gather[_u64base]_offset_{s,u}64
svldnt1sw_gather_[{s,u}64]offset_{s,u}64
svldnt1sw_gather[_u64base]_offset_{s,u}64
svldnt1_gather_[{s,u}64]offset[_{s,u,f}64]
svldnt1_gather[_u64base]_offset_{s,u,f}64
svldnt1ub_gather_[u32]offset_{s,u}32
svldnt1ub_gather[_u32base]_offset_{s,u}32
svldnt1sb_gather_[u32]offset_{s,u}32
svldnt1sb_gather[_u32base]_offset_{s,u}32
svldnt1uh_gather_[u32]offset_{s,u}32
svldnt1uh_gather[_u32base]_offset_{s,u}32
svldnt1sh_gather_[u32]offset_{s,u}32
svldnt1sh_gather[_u32base]_offset_{s,u}32
svldnt1_gather_[u32]offset[_{s,u,f}32]
svldnt1_gather[_u32base]_offset_{s,u,f}32
scatter non-temporal
svstnt1b_scatter_[{s,u}64]offset[_{s,u}64]
svstnt1b_scatter[_u64base]_offset[_{s,u}64]
svstnt1h_scatter_[{s,u}64]offset[_{s,u}64]
svstnt1h_scatter[_u64base]_offset[_{s,u}64]
svstnt1w_scatter_[{s,u}64]offset[_{s,u}64]
svstnt1w_scatter[_u64base]_offset[_{s,u}64]
svstnt1_scatter_[{s,u}64]offset[_{s,u,f}64]
svstnt1_scatter[_u64base]_offset[_{s,u,f}64]
svstnt1b_scatter_[u32]offset[_{s,u}32]
svstnt1b_scatter[_u32base]_offset[_{s,u}32]
svstnt1h_scatter_[u32]offset[_{s,u}32]
svstnt1h_scatter[_u32base]_offset[_{s,u}32]
svstnt1_scatter_[u32]offset[_{s,u,f}32]
svstnt1_scatter[_u32base]_offset[_{s,u,f}32]
prefetch (gather)
prefetch (contiguous)

Others

AES Operations

Perform a single round of AES encryption
Perform a single round of AES decryption
Perform a single round of the AES "mix columns" transformation
Perform a single round of the AES "inverse mix columns" transformation

SM4 Operations

Perform four rounds of SM4 encryption
Derive four rounds of SM4 key values

SHA3 Operations

Rotate 64-bit values left by 1-bit, then xor

Scalar Operations

Add multiple of vector length in bytes
Add multiple of predicate length in bytes
Get multiple of predicate length in bytes
Compare and terminate loop

Move Prefix

Move operations that may only be used as prefixes to certain instructions

Created by Dougall Johnson, 2023.
Arm is a registered trademark of Arm Limited (or its subsidiaries) in some places.