## Scalable vector multimedia optimisations RISC-V V and ARM SVE2 extensions introduction

Rémi Denis-Courmont

Remlab Tmi

Ixelles, Belgium, 4th February 2023

# Outline



- 2 From fixed-sized to variable-length
- 3 ARM Scalable Vector Extension



| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| ●○        | 0000    | 00000           |         | 000000 | oo  |
|           |         |                 |         |        |     |

## Attendees advisory

#### Disclaimer

# The opinions expressed therein solely represent the personal views of the author.

| Forewords | History | Variable length | ARM SVE | <b>RVV</b> | End |
|-----------|---------|-----------------|---------|------------|-----|
| ●O        | 0000    | 00000           |         | 000000     | 00  |
|           |         |                 |         |            |     |

## Attendees advisory

### Disclaimer

# The opinions expressed therein solely represent the personal views of the author.

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- I speak fast.
- I do not articulate well.

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| ●O        | 0000    | 00000           |         | 000000 | 00  |
|           |         |                 |         |        |     |

## Attendees advisory

### Disclaimer

The opinions expressed therein solely represent the personal views of the author.

◆□▶ ◆□▶ ◆□▶ ◆□▶ □ ○ ○ ○

- I speak fast.
- I do not articulate well.

If you did not understand...

Do interrupt me if needed!

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| ○●        | 0000    | 00000           |         | 000000 | oo  |
| Who am    | n I?    |                 |         |        |     |

## • 16th FOSDEM attendance (since 2004)...

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| ○●        | 0000    | 00000           |         | 000000 | 00  |
| Who ar    | n I?    |                 |         |        |     |

◆□▶ ◆□▶ ◆臣▶ ◆臣▶ □臣 ○のへ⊙

- 16th FOSDEM attendance (since 2004)...
- 1st FOSDEM presentation!
- Not relevant to this presentation.

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | ●000    | 00000           | 00000   | 000000 | 00  |
| Outline   |         |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ



- 2 From fixed-sized to variable-length
- ③ ARM Scalable Vector Extension
- A RISC-V Vectors

| Forewords | History<br>0●00 | Variable length<br>00000 | ARM SVE | <b>RVV</b><br>000000 | End<br>00 |
|-----------|-----------------|--------------------------|---------|----------------------|-----------|
|           |                 |                          |         |                      |           |

# What is this?

You may know if older than me.





## Planet of Death

You may know if my age.



◆□▶ ◆□▶ ◆三▶ ◆三▶ 三三 - のへで



• x86

• 64 bits: MMX (1997)





▲□▶ ▲圖▶ ▲匡▶ ▲匡▶ ― 匡 … のへで

## Single Instruction Multiple Data

- x86
  - 64 bits: MMX (1997)
  - 128 bits: SSE (1999)



▲□▶ ▲圖▶ ▲匡▶ ▲匡▶ ― 匡 … のへで

## Single Instruction Multiple Data

- x86
  - 64 bits: MMX (1997)
  - 128 bits: SSE (1999), SSE2 (2000)



- x86
  - 64 bits: MMX (1997)
  - 128 bits: SSE (1999), SSE2 (2000)... AVX (2008)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

• 256 bits: AVX2 (2011)



#### • x86

- 64 bits: MMX (1997)
- 128 bits: SSE (1999), SSE2 (2000)... AVX (2008)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- 256 bits: AVX2 (2011)
- 512 bits: AVX-512 (2013 2017)



#### • x86

- 64 bits: MMX (1997)
- 128 bits: SSE (1999), SSE2 (2000)... AVX (2008)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- 256 bits: AVX2 (2011)
- 512 bits: AVX-512 (2013 2017)
- ARM
  - 32 bits: ARMv6 SIMD (2002)



#### • x86

- 64 bits: MMX (1997)
- 128 bits: SSE (1999), SSE2 (2000)... AVX (2008)
- 256 bits: AVX2 (2011)
- 512 bits: AVX-512 (2013 2017)
- ARM
  - 32 bits: ARMv6 SIMD (2002)
  - 128 bits: ARMv7 AdvSIMD, a.k.a. NEON (2005)
  - 128 bits: ARMv8 A64 AdvSIMD, also a.k.a. NEON (2012)

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- RISC-V
  - ENOSYS

Need to rewrite assembler every time.

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | ●0000           |         | 000000 | 00  |
| Outline   |         |                 |         |        |     |



2 From fixed-sized to variable-length

3 ARM Scalable Vector Extension

## 4 RISC-V Vectors

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | ○●○○○           |         | 000000 | 00  |
| Vector    | length  |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三 のへぐ

#### Dear CPU, what is your vector length?

## csrr t0, vlenb /\* Vector LENgth in Bytes \*/

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | ⊙●○○○           |         | 000000 | 00  |
| Vector    | length  |                 |         |        |     |

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

#### Dear CPU, what is your vector length?

csrr t0, vlenb /\* Vector LENgth in Bytes \*/

Dear CPU, how many elements can you process?

csrr t0, vlenb

slri t0, t0, #2 /\* 32-bit elements \*/

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 0●000           |         | 000000 | 00  |
| Vector    | length  |                 |         |        |     |

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

#### Dear CPU, what is your vector length?

csrr t0, vlenb /\* Vector LENgth in Bytes \*/

#### Dear CPU, how many elements can you process?

csrr t0, vlenb
slri t0, t0, #2 /\* 32-bit elements \*/

- Write main loop.
- Onroll main loop.
- Oeal with edges.

That is how Clang vectorisation does it...

| Forewords<br>00 | History<br>0000 | Variable length<br>00●00 | ARM SVE | RVV<br>000000 | End<br>00 |
|-----------------|-----------------|--------------------------|---------|---------------|-----------|
| Vector          | length          |                          |         |               |           |
| Possible an     | iswers          |                          |         |               |           |

(ロ)、(型)、(E)、(E)、 E) の(()

• A power of two!

<sup>1</sup>except *embedded* RISC-V

| Forewords<br>00 | History<br>0000 | Variable length<br>00●00 | ARM SVE | RVV<br>000000 | End<br>00 |
|-----------------|-----------------|--------------------------|---------|---------------|-----------|
| Vector          | length          |                          |         |               |           |
| Possible an     | SWORS           |                          |         |               |           |

- A power of two!
- 128 bits: guaranteed minimum<sup>1</sup>.
- 256, 512 bits: silicon designs announced, yet to ship.
- 1024 bits, even 4096 proposed in (RISC-V) simulations.
- 65536 bits: syntactic maximum (RISC-V).



- Not *completely* new concept
- Essential to variable vector length programming model

▲□▶ ▲圖▶ ▲匡▶ ▲匡▶ ― 匡 … のへで

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 000€0           |         | 000000 | 00  |
| Predica   | ation   |                 |         |        |     |

- Not *completely* new concept
- Essential to variable vector length programming model
- Vector of boolean
- Selects loaded/modified/stored elements

| ARM | lv9 exampl | e             |
|-----|------------|---------------|
|     | MOV        | x10, xzr      |
|     | В          | 2f            |
| 1:  |            |               |
|     | •••        |               |
| 2:  | WHILELT    | p0.s, x10, x0 |
|     | B.FIRST    | 1b            |

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 0000●           |         | 000000 | 00  |
| Unrollir  | ng      |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ

- Ill fit with predication
- Vector processing  $\neq$  SIMD
- Just don't unroll...

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 0000●           |         | 000000 | 00  |
| Unrollir  | ng      |                 |         |        |     |

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- III fit with predication
- Vector processing  $\neq$  SIMD
- Just don't unroll...
- ARM: "SVE streaming mode"
  - Higher latency
  - Larger vectors (potentially)
  - Higher throughput
- No over-alignment required! Yay!

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| Outline   |         |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ



- 2 From fixed-sized to variable-length
- 3 ARM Scalable Vector Extension

## 4 RISC-V Vectors

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| SVE       |         |                 |         |        |     |

## • Original SVE pretty useless for multimedia.



| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| SVE       |         |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三 のへぐ

- Original SVE pretty useless for multimedia.
- SVE2 copies most NEON mnemonics.
- Just insert the predicate register operand!
- Famous last words.

| Forewords | History<br>0000 | Variable length<br>00000 | ARM SVE | RVV<br>000000 | End<br>00 |
|-----------|-----------------|--------------------------|---------|---------------|-----------|
| SVE       |                 |                          |         |               |           |

Pick:

I of 10 WHILEXX instruction: WHILELT, WHILELO, ...

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- a predicate register,
- **3** the element size: B, H, S or D.
- a branch condition: B.FIRST, B.LAST...

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| SVE       |         |                 |         |        |     |

#### Pick:

I of 10 WHILEXX instruction: WHILELT, WHILELO, ...

▲□▶ ▲□▶ ▲□▶ ▲□▶ ■ ●の00

- a predicate register,
- the element size: B, H, S or D.
- a branch condition: B.FIRST, B.LAST...
  - Remaining elements  $\rightarrow$  Predicate register
  - $\bullet$  Predicate register  $\rightarrow$  Condition flags
  - $\bullet$  Subtracted count  $\rightarrow$  Output GP register

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| SVE       |         |                 |         |        |     |

#### Pick:

I of 10 WHILEXX instruction: WHILELT, WHILELO, ...

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- a predicate register,
- the element size: B, H, S or D.
- a branch condition: B.FIRST, B.LAST...
  - Remaining elements  $\rightarrow$  Predicate register
  - $\bullet$  Predicate register  $\rightarrow$  Condition flags
  - $\bullet$  Subtracted count  $\rightarrow$  Output GP register

Stop pretending AArch64 is a RISC.



▲ロ▶ ▲周▶ ▲ヨ▶ ▲ヨ▶ ヨ のなべ

## Processor feature detection

It would be too easy without it.

- Preprocessor: defined(\_\_ARM\_FEATURE\_SVE2)
- Bare metal: ID\_AA64\*\_EL1 register fields



## Processor feature detection

It would be too easy without it.

- Preprocessor: defined(\_\_ARM\_FEATURE\_SVE2)
- Bare metal: ID\_AA64\*\_EL1 register fields
- Linux: bits from AT\_HWCAP2 auxillary vector entry
  - HWCAP2\_SVE2 is probably what you want
  - HWCAP2\_SVEPMULL
  - HWCAP2\_SVEBITPERM
  - HWCAP2\_SVE2P1

#### Examples

#include <sys/auxv.h>
(getauxval(AT\_HWCAP2) & HWCAP2\_SVE2)



## Processor feature detection

It would be too easy without it.

- Preprocessor: defined(\_\_ARM\_FEATURE\_SVE2)
- Bare metal: ID\_AA64\*\_EL1 register fields
- Linux: bits from AT\_HWCAP2 auxillary vector entry
  - HWCAP2\_SVE2 is probably what you want
  - HWCAP2\_SVEPMULL
  - HWCAP2\_SVEBITPERM
  - HWCAP2\_SVE2P1

#### Examples

#include <sys/auxv.h>
(getauxval(AT\_HWCAP2) & HWCAP2\_SVE2)

• Other OSes: lol



SpecificationsSVE (2016)...





## Specifications

• SVE (2016)... explicitly not intended for multimedia payloads

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- SVE2 (2019)
- SME / Scalable Matrix Extension (2021)
- Streaming SVE

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| Availat   | oility  |                 |         |        |     |

## Specifications

• SVE (2016)... explicitly not intended for multimedia payloads

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- SVE2 (2019)
- SME / Scalable Matrix Extension (2021)
- Streaming SVE
- Hardware
  - Cortex-X2, Cortex-A510, Cortex-A710
  - Arm DynamlQ-110 cluster (2022)

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| Availat   | oility  |                 |         |        |     |

## Specifications

• SVE (2016)... explicitly not intended for multimedia payloads

- SVE2 (2019)
- SME / Scalable Matrix Extension (2021)
- Streaming SVE
- Hardware
  - Cortex-X2, Cortex-A510, Cortex-A710
  - Arm DynamlQ-110 cluster (2022)
  - Samsung Exynos 2200
  - Qualcomm SM8450 Snapdragon 8 Gen 1

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | ●00000 | 00  |
| Outline   |         |                 |         |        |     |

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ



- 2 From fixed-sized to variable-length
- 3 ARM Scalable Vector Extension



| Forewords | History<br>0000 | Variable length<br>00000 | ARM SVE | RVV<br>0€0000 | End<br>00 |
|-----------|-----------------|--------------------------|---------|---------------|-----------|
|           |                 |                          |         |               |           |

# Predication

Not sure if simpler or more intricate

### Vector configuration

vsetvli t0, a4, e16, m1, ta, ma

- a4 = available elements (input)
- Output operand: t0 = vector length (output)

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- Element size: e16  $\leftrightarrow$  16 bits
- Group size:  $\texttt{m1} \leftrightarrow \texttt{1}$  vector  $\Leftrightarrow$  no grouping
- Tail mode: ta agnostic ⇔ don't care
- Mask mode: ma agnostic ⇔ don't care

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | 00  |
| Registe   | ers     |                 |         |        |     |

- Prefer greatest power-of-two multiple-numbered vectors
   proving and commentation require aligned numbers
  - $\succeq$  grouping and segmentation require aligned numbers

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- FP registers  $\neq$  Vectors

#### Warning

Mind the FP calling conventions!



 Segmented loads & stores up to 8 structures (ARM can do up to 4 only)

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- GP register-strided loads & stores
- ... including negative strides.

#### Example

# Load a *column* of 16-bit samples # at [a0] with pitch a4 in vector v8. vlse16.v v8, (a0), a4



 Segmented loads & stores up to 8 structures (ARM can do up to 4 only)

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- GP register-strided loads & stores
- ... including negative strides.

#### Example

# Load a column of 16-bit samples # at [a0] with pitch a4 in vector v8. vlse16.v v8, (a0), a4

● But... no vector↔vector transpose/zip



# Processor feature detection

#### • Preprocessor:

- Element size:  $\__riscv_v_elen_fp = 32 \text{ or } 64$
- \_\_riscv\_vector  $\Rightarrow$  *elen*  $\geq$  64 bits
- Vector length: \_\_riscv\_zvl{32,64,128,...}b

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- \_\_riscv\_vector  $\Rightarrow$  VL  $\geq$  128 bits
- Hardware:



# Processor feature detection

#### • Preprocessor:

- Element size:  $\__riscv_v_elen_fp = 32 \text{ or } 64$
- \_\_riscv\_vector  $\Rightarrow$  *elen*  $\geq$  64 bits
- Vector length: \_\_riscv\_zvl{32,64,128,...}b
- \_\_riscv\_vector  $\Rightarrow$  VL  $\geq$  128 bits
- Hardware: DeviceTree cpu node property
- Linux: bit 21 from AT\_HWCAP auxillary vector entry

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

#### Examples

#include <sys/auxv.h>
(getauxval(AT\_HWCAP) & (1U << ('V' - 'A')))</pre>



- Specifications
  - RISC-V "V" Vector extension version 1.0 (ratified 2021)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

• Not integrated in RISC-V unprivileged specificaton yet



- Specifications
  - RISC-V "V" Vector extension version 1.0 (ratified 2021)

▲ロ ▶ ▲周 ▶ ▲ 国 ▶ ▲ 国 ▶ ● の Q @

- Not integrated in RISC-V unprivileged specificaton yet
- Hardware
  - Open-source designs exist (but...)
  - T-Head (Alibaba): draft version 0.7.1 only so far
  - SiFive: several IPs announced, not sold yet
  - Andes: AX45, not sold yet

| Forewords | History    | Variable length | ARM SVE | RVV    | End |
|-----------|------------|-----------------|---------|--------|-----|
| 00        | 0000       | 00000           | 00000   | 000000 | ●○  |
| Furthe    | r referenc | es              |         |        |     |

▲□▶ ▲□▶ ▲□▶ ▲□▶ □ のQで

- Arm Architecture Reference Manual, ARMv8-A
- Arm SVE supplement
- Arm SME supplement
- RISC-V Vector extension version 1.0.
- FFmpeg source code.

| Forewords | History | Variable length | ARM SVE | RVV    | End |
|-----------|---------|-----------------|---------|--------|-----|
| 00        | 0000    | 00000           |         | 000000 | ⊙●  |
|           |         |                 |         |        |     |

# Any questions?

▲□▶ ▲□▶ ▲ 三▶ ▲ 三▶ 三三 - のへぐ