# hXDP: Efficient Software Packet Processing on FPGA NICs

110101100000101 5010011100110

00 100000 10010010000

001011011

010100100

1111017

000100

01

100

00

01000

Marco Spaziani Brunella,

Principal Hardware Engineer @ Axbryd



@marcoSpazianiB



# Background

• Network packet processing is ubiquitous





- CPU performance issues
  - Starvation of Moore's and Dennard's scaling laws
  - Need to save CPU cycles for things that cannot be done elsewhere  $\odot$
- Welcome to network accelerators!
  - One size doesn't fit all





Much smaller Uses 20W Takes space Uses >200W



# Why FPGAs?

**The Microsoft Catapult Project** 

**Derek Chiou** Partner Hardware Architect, Microsoft Research professor, University of Texas at Austin

All new Microsoft Azure and Bing servers are being deployed with an FPGA that server and the data center network and on the PCIe bus. The FPGA is currently be networking on Azure machines and search on Bing machines, but could very ( retargeted to other uses as needed. In this talk, I will describe how we decided on this data center model it introduces, and the benefits it provides.

- Increasing dep
- FPGA FOR 5G: RE-CONFIGURABLE HARDWARE FOR • Machine Learn
- 5G radio acces

# NEXT GENERATION COMMUNICATION

Vinay Chamola, Sambit Patra, Neeraj Kumar, and Mohsen Guizani

#### Microsoft Corporation

DNNs continue to deliver major breakthroughs in challenging AI domains such as computer vision and natural language processing, their computational demands have

steadily outpaced the performance growth rate of standard CPUs. These trends have spurred a Cambrian explosion of specialized hardware, as large companies, startups, and research efforts shift en masse towards energy-efficient accelerators such as GPUs, FPGAs, and neural processing units (NPUs)1-3 for AI workloads that demand performance beyond mainstream processors.



### Serving DNNs in Real Time at Datacenter Scal with Project Brainwave

Eric Chung, Jeremy Fowers, Kalin Ovtcharov, learning, cloud operators are turning toward Michael Papamichael, Adrian Caulfield, Todd specialized hardware for improved efficiency and Massengill, Ming Liu, performance. Project Brainwave, Microsoft's principa Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen infrastructure for AI serving in real time, accelerates Abeydeera, Logan Adams, deep neural network (DNN) inferencing in major Hari Angepat, Christian Boehn, Derek Chiou, Oren services such as Bing's intelligent search features Firestein, Alessandro Forin. and Anura Evalating distributed model perallelier

### To meet the computational demands required of deep olution

# The problem with FPGA-based NICs

#### **Network Function Logic** dupACKctr=0 ownd = cwnd + MSS may update rio\_timeout reset no timer send new packets, as allows ADC\_CS\_N, ADC\_DIN, ADC\_DOUT, ADC\_SCLK, ACK && New d + MSS < sathreah dupACKctr=0 cend = cend + MSS re rew RTT sample, update no\_timeout reset no timer packets, as allowed inout output input output ////// input inout inout output output Congestion Avoidance AUD\_ADCDAT, AUD\_ADCLRCK, AUD\_BCLK, AUD\_DACDAT, AUD\_DACDAT, AUD\_DACLRCK, AUD\_XCK, ssthresh = cwnd/2 cwnd=1 rto\_timeout = 2°rto\_tir reset rto timer ACK && Dup && dupACKctr -dupACKctr++ reset no timer /////// input input input //////// CLOCK2\_50, CLOCK3\_50, CLOCK4\_50, CLOCK\_50, cwnd = sshthresh dupACKctr = 0 ACK && Dup && dupACKctr + 1 == 3 dupACKctr = 0 asthresh = cwmd2 cwmd = asthresh + 3 high\_water= last packet sen reset no timer retransmit last unacked pac /////// output output output output output IEX /////// [6:0] HEX0, [6:0] HEX1, [6:0] MEX2, [6:0] MEX3, [6:0] MEX4, [6:0] HEX5, **Synthesis** Simulate Code CAMBRIDGE NetFPGA-SUME **NetFPGA SUME**

### **Programming them is Hard!**



# Making programming easier

Code

All the approaches assume that a significant portion of the FPGA is dedicated to networking tasks, consuming a significant amount of HW resources

**Simulation** 

### Expressive

### Hardware expertise

ClickNP [Sigcomm '16], Emu [ATC '17]

### **NF Logic focused**

### **Exotic/Limited prog. model**

**Synthesis** 

P4 [CCR '14], Domino [Sigcomm '16] FlowBlaze [NSDI '19]



# Our approach

- 1. Take the eBPF infrastructure
  - Packet filter implemented in Linux Kernel 4.18+
  - RISC-inspired in-kernel virtual machine that executes eBPF bytecode
  - In-kernel "Maps" and "Helper Functions"
- 2. Re-create the same infrastructure on the FPGA
  - VLIW core to execute optimized eBPF bytecode
  - Hardware-based Maps and Helper Function
- 3. "Offload" the eBPF execution to the FPGA





## eXpress DataPath

- One of the many eBPF hooks
  - At the earliest point in the stack
- Avoids kernel bypass
- CPU load scales with traffic load
- Transparent to the host





# XDP program life-cycle





# hardware eXpress DataPath







# Challenges

- hXDP resource occupancy must be small
  - Minimize HW resources requirements
  - Allow designers to fit different Accelerators on the FPGA

- hXDP performance must be comparable to the ones of an x86 CPU
  - be as fast as a server-grade CPU core
  - FPGAs is clocked at 5x-10x lower frequency than server CPUs



# Challenge: make it small!

- We assume the FPGA is used for other accelerators
  - CPU
     RAM

     Cache Memory
     RAM

     PCle BUS
     Data-Preprocessing Accelerator

     Host Machine
     FPGA NIC

- hXDP Design Principles
  - Keep hardware simple
  - Adapt ISA to simplify HW design and gain performance
  - Move the ILP extraction complexity to the compiler/optimizer



# hXDP resources utilization



NetFPGA Virtex-7 Die

| COMPONENT        | LOGIC       | REGISTERS  | BRAM        |
|------------------|-------------|------------|-------------|
| PIQ              | 215, 0.05%  | 58, <0.01% | 6.5, 0.44%  |
| APS              | 9к, 2.09%   | 10к, 1.24% | 4, 0.27%    |
| Sephirot         | 27к, 6.35%  | 4к, 0,51%  | -           |
| INSTR MEM        | -           | -          | 7.7, 0.51%  |
| STACK            | 1к, 0.24%   | 136, 0.02% | 16, 1,09%   |
| HF SUBSYSTEM     | 339, 0.08%  | 150, 0.02% | -           |
| MAPS SUBSYSTEM   | 5.8к, 1.35% | 2.5к, 0.3% | 16, 1.09%   |
| TOTAL            | 42к, 9.91%  | 18к, 2.09% | 50, 3.40%   |
| W/ REFERENCE NIC | 80к, 18.53% | 63к, 7.3%  | 214, 14,63% |

Table 1

Closed timing @156.25MHz an a NetFPGA-SUME



# Challenge: make it fast!





- Clock Frequency: 2-4 GHz
- Hardware-enanched ILP extraction
- Deep Pipeline stages ٠
- Specialized iterative execution •



- Not suited for any complex ILP hardware\* ٠
- Short pipeline stages ٠

٠

Killer app: parallel execution •



# Filling the gap

• Execute eBPF bytecode in a specialized VLIW CPU

All the complexity for code parallelization is pushed at "compile" time.

To illustrate code optimizations, we will use a simple eBPF UDP tracker program

varding

- Code Optimization
  - eBPF Instruction Set Architecture extension
  - Pruning of unnecessary instructions



•

# Optmizing eBPF: zeroing



### eBPF Bytecode



Unnecessary on hardware  $\rightarrow$  we can provide zeroed memory



# Optimizing eBPF: 3-operands instructions



# Trivial to do in hardware and to recognize at compile time



# Optimizing eBPF: Boundary Checks





# Extending eBPF: 6B load/store

| LDX48_0PC | : std_logic_vector(7 downto 0) := "01011001"; 0x59 |
|-----------|----------------------------------------------------|
| LDXW_0PC  | : std_logic_vector(7 downto 0) := "01100001"; 0x61 |
| LDXH_0PC  | : std_logic_vector(7 downto 0) := "01101001"; 0x69 |
| LDXB_0PC  | : std_logic_vector(7 downto 0) := "01110001"; 0x71 |
| LDXDW_OPC | : std_logic_vector(7 downto 0) := "01111001"; 0x79 |
|           |                                                    |
| STX48_0PC | : std_logic_vector(7 downto 0) := "01010010"; 0x52 |
| ST48_0PC  | : std_logic_vector(7 downto 0) := "01011010"; 0x5a |
|           |                                                    |

6 Bytes loads & stores



# Optimizing eBPF: exit compression

| EOP:   |      |       |
|--------|------|-------|
| return | XDP_ | DROP; |

| 70: | bf 60 00 00 00 00 00 00 00 | r0 = 1 |  |
|-----|----------------------------|--------|--|
| 71: | 95 00 00 00 00 00 00 00    | exit   |  |

Define per-action exit



exit\_drop



### Impact of Code optimizations on original eBPF bytecode



Building a lighter world

$$\% gain = 100 imes rac{\#_{original\ instr} - \#_{optimized\ instr}}{\#_{optimized\ instr}}$$

# Overall gain: optimization + ILP



# Performance evaluation: Microbenchmarks



Baseline tput measurements for basic XDP programs



Forwarding tput when calling a helper function



Impact on forwarding tput on map accesses



Packet generation: ~60 Mpps (64B UDP packets)

# Performance evaluation: Linux XDP programs



hXDP@156.25MHz has comparable performance to an x86@2.1GHz for programs that live entirely in the NIC



# Performace evaluation: Real-world applications



### hXDP@156.25MHz outperforms an x86@2.1GHz



Packet generation: ~60 Mpps (64B UDP packets)

# Performance evaluation: Latency Measurements





# What's new since Nov-2020?

- Swapped the platform
  - We're actively developing on a Xilinx Alveo U50



- Benefits
  - Bumped form 156.25MHz to 250MHz
  - Exploiting new URAMs on the Virtex Ultrascale+ for bigger maps
  - Backing URAMs with High Bandwidth Memory for huge maps 🙂
  - Tighter host interaction thanks to Corundum's driver!



U50 E ALVEO.

# Many-core design?



- Unidirectional NAT use-case
  - Avoids shared Maps
- From 4 to 2 lanes per each CPU
  - Working on DLP rather than ILP
- Closed the design with 4 CPUs
- Trying to have 8 🙂
- With 4 cores, we're at 5.6x

# Conclusion

- hardware eXpress DataPath
  - eBPF infrastructure on FPGA NICs
- Benefits
  - Executes unmodified eBPF programs
  - Low Hardware resources
  - Frees up CPU cores with similar performance at 10x lower latency



# Next Steps

- Compiler
  - Re-oreder memory access instruciton to improve ILP
- Hardware Parser
  - Offload large sections of eBPF programs to dedicated HW block
- Huge Maps
  - Bilion of entries  $\rightarrow$  Transparent hierarchy of all the memory resources on the FPGA
- ASIC
  - Fixed functionalities (e.g. Sephirot)  $\rightarrow$  put them into custom silicon







# Axbryd Building a lighter world

