Notes: AccelNet

January 14, 2025

Microsoft’s AccelNet has been one of my favorite papers. This might be because I spent so much time with it before giving a class presentation, but it also checks all the boxes: unique insight cuts through cost/perf gradient (check), perf gains (check), hyperscale/industry (big check).

It describes their process in pushing down VM networking overhead to the HW–offloading per-VM software switching (VFP) to a bump-in-the-wire FPGA (Azure SmartNIC). Previously, host-based SDN technologies were responsible for “all virtual networking features” and as that complexity swells so does the number of CPU cycles.

They observe that although networking speeds have scaled 40x in the past few years (along with virtual networking complexity), CPU has not seen the same growth (this dissonance is what makes the end of Moore’s Law scaling so difficult). SR-IOV was a proposed solution, but bypasses the entire virtual networking stack. The goal, thus, was to find a solution that reduces CPU consumption, maintains HW line-rate, and enforces SDN policies.

I’ll record some of my highlights and notes here for reference, but I think this paper is really powerful in its ability to be read with many lenses (as perhaps typical of industry papers). The economics behind Azure are really fascinating, as are the technical challenges and logistics for such a hyperscale deployment.

Notes

Host SDN networking stacks

“increases latency and latency variability” - this is an important point that continues to be alluded to in many papers. Host based virtualization and (more pointed in my case) measurements sustain so much variation (TCPdump, etc). Pushing to the HW is super nice (if not prohibitive)
network I/O to and from physical device performed in host software partition of the hypervisor
- host has a vSwitch (virtual switch) that implements SDN policies
- hypervisor copies packet into VM-visible buffer, allowing VM’s OS stack to continue network processing
- obviously not great compared to a non-virtualized implementation, but necessary for SDN policies

Why FPGAs?

ASICs might be faster, but the ‘ASIC cycle’ is too long - and they are not programmable enough
As AccelNet was deployed on more and more servers, larger teams began working with the technology; more “hands in the pot”, so more features to be pushed, bugs to be fixed, etc - programmability and the ability to iterate quickly is super important and is worth the speed hit from not using ASICs
- pre-fab validation only goes so far; business specs are always changing and being able to iterate on the fly is super important
embedded CPU cores maybe offer a richer instruction set but “do not provide scalable performance, especially on single network flows”

SR-IOV - Single Root I/O Virtualization

allow direct access to NIC hardware from VM (shared PCIe HW)
host connects to Physical Function (PF) and each VM has its own Virtual Function (VF)
bypasses host SDN stack (VFP), so NIC has to implement all SDN policies
- policies change rapidly -> need a solution that “provide[s] software-like programmability while providing hardware-like performance”

Generic Flow Tables (GFT)

designed in 2013-2014
makes VFP compatible with SR-IOV; still incurring CPU overhead for first packet in each flow and need HW offload
single large table that has an entry for every active network flow on host (match-action)
GFT flows are defined based on VFP unified flows (UF) and header transposition (HT) action
- UF: matches unique source and destination (L2/3/4 tuple)
- HT: how header fields are to be mutated
on the first packet in a flow, packet is send to VFP on host to get the corresponding SDN policy; after offloading this compiled policy to the GFT, the remaining packets in the flow are processed on-NIC

So, AccelNet’s goal is to offload GFT. SR-IOV is promising, but bypasses the VFP by itself. GFT solves this problem, but needs to be pushed to HW. They define the requirements for the system to inform their HW decision

Don’t burn host CPU cores
Maintain host SDN programmability of VFP
- Offloading every rule to hardware would constrain SDN policy or require hardware logic to be updated every time a new rule was created
- Even for short flows, there are at least 7-10 packets for handshakes, so it’s still a significant speedup
Bench against SR-IOV “upper bound”/“ideal”
High single-connection performance
Single CPU core cannot achieve peak bandwidth at 40Gb and higher
- You can however, break single connections into multiple parallel connections, using multiple threads to spread load to multiple cores, but this requires substantial changes to customer applications; even apps that handle multiple connections don’t necessary scale well because flows are often bursty.
- AccelNet wants near-peak bandwidths w/o parallelizing network processing

At the time of writing this paper, a physical core (2 hyperthreads) sells for $0.10-0.11/hr, or a maximum potential revenue of around $900/yr, and $4500 over the lifetime of a server (servers typically last 3 to 5 years in our datacenters). Even considering that some fraction of cores are unsold at any time and that clouds typically offer customers a discount for committed capacity purchases, using even one physical core for host networking is quite expensive compared to dedicated hardware.

SR-IOV as “all or nothing” - if smartNIC can’t handle SDN feature on chip, it would have to go to host SDN, losing all of the benefits of offloading the work

The HW gradient can be understood to be, coarsely: ASICs, multi-core SOCs, FPGAs, and host CPU.

ASIC vendors often add embedded CPU cores to handle new functionality, but this often becomes a bottleneck; also updates come from the manufacturer, so can be very slow
Multicore SoC-based NICs use a “sea of embedded CPU cores” to process packets
- trades some performance to provide much better programmability
- At speeds of 40GbE+, the number of cores increases significantly (need chips to organize processing, doesn’t just scale linearly)
- significantly higher latency and variability for packet processing vs ASICs
- stateful flows are typically mapped to one core/thread, so individual network flow performance doesn’t improve
FPGAs ‘balance the performance of ASICs with the programmability of SOC NICs’
- Very importantly, AccelNet had the benefit of observing another Microsoft project (Catapult) which deployed networked FPGAs and had good results
- also used some of their work to help deploy at hyperscale

Quick qs

Why not DPDK? Reduced cost of software packet processing significantly, but not enough to even beat out the SoC options.
- However, support DPDK inside of the VMs directly to directly access the smartNIC. This provides failover to prevent outages during maintenance/upgrade windows
- SR-IOV VF is not exposed directly to the VM (as this would break flows when serviced)
- when VF is up, Hyper-V Network Virtual Service Consumer (NetVSC) marks the VF as its slave (“transparent bonding”)
  for DPDK apps, they use fail safe PMD (Poll Mode Driver) which acts as a bond between the VF PMG and a PMD on the synthetic interface
- failsafe PMD exposes all of the DPDK APIs so no impact to programmability, just a drop in performance when the VF is down
  harder for RDMA, so currently lets RDMA connections close and failovers to TCP (using failover defined above)
At MS scale, NREC are amortized and costs become dominated by price of silicon

AccelNet implementation is to “augment the current NIC functionality with an FPGA” (too much work to build a custom FPGA NIC)

logically interact with system as a bump-in-the-wire between NIC and ToR switch (a “filter” on the network)
also connected to CPUs; RDMA (Remote Direct Memory Access)
2nd generation has onboard NIC (but same logical architecture)
keep control plane (VFP) in host, offload data plane processing to FPGA smartNIC
GFT Lightweight Filter (LWF) -> FPGA/NIC appears as a single NIC with SR-IOV and GFT support
- When GFT doesn’t contain matching rule for a packet, offload hardware will send the packet to the software layer as an Exception Packet (common for 1st packet in flow)
- FPGA overrides VLAN id tag and forwards to hypervisor’s vPort
- When FPGA detects terminated flows, it duplicates the packet to send a copy to the hypervisor vPort which deletes rule from flow table

GFT offload - 2 deeply pipelined packet processing units

store and forward packet buffer
parser
- “current support parsing and acting on up to 3 groups of L2/L3/L4 headers” (9 headers total)
- seems very not a lot, compared to OpenFlow’s 41 or something. wonder why this is
flow lookup and match
L1 cache (SRAM) on chip and L2 cache on DRAM
- L1 is direct mapped cache, 2048 flows
- L2 is 8-way associative cache with support for O(1M) flows
flow action
- action block uses microcode to specify behavior of actions (reconfigurable)
- doesn’t specify how match-action works (like previous papers had)

Thoughts

In Firestone’s NSDI presentation, he describes CPU as “temporal compute” and FGPA as “spatial compute”. FPGAs are pipeline parallel, instructions sit in hardware (circuitry) and you “stream” data through it.

A common question is that “aren’t FPGAs much bigger than ASICs”? This make me think of the RMT paper which demonstrated that a programmable data plane chip would not be that much larger/more expensive that a fixed function ASIC, but overall, the answer is not really (in practice). Much of the chip is already transceivers, memory (SRAM), etc. In fact, ASICs have been including more reconfigurable logic (to provide more programmability) and FPGAs include more custom logic - so the two have been converging to a degree. So, are they a little bit larger than ASICs? Yes, but the reconfigurability was valued more.

The publication of this paper coincides with the Spectre and Meltdown attacks (although it was obviously in the works since 2013-2014). They note that the necessary page table mitigations negatively impacted CPU I/O, but since they were largely bypassing the host, customer networking speeds were largely unaffected.

Another lens to read the paper is an organizational one. The AccelNet team references Project Catapult a few times and how they were a pretty critical player in AccelNet’s success. This reminded me of Bob Colwell’s The Pentium Chronicles and the interplay between the P5 (Santa Clara) and P6 (Oregon) teams. The Intel case is particularly interesting because the front is partitioned along the same objective, whereas the Project Catapult/AccelNet divide seems more orthogonal. Regardless, in both cases this organizational split emphasizes the importance of inter-team communication, where processes (like FPGA deployment/programming) are significantly accelerated given the other team’s previous experience.

This is clearly a massive project and, given what’s at stake (Azure’s bottom line), this makes a lot of sense. I like how it brings together so many changes: SDN and HW development, internal/external tooling, changes to the Linux Kernel, etc. They code in Verilog, so can switch the underlying HW implementation at will, and found the process not so bad. The pushdown to HW introduces a more pronounced HW/SW divide (along the SDN stack), but they also found coorchestrating software/hardware successful for Azure subteams.

Azure Accelerated Networking: SmartNICs in the Public Cloud

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, and Albert Greenberg, Microsoft

Published: 2018

https://www.usenix.org/conference/nsdi18/presentation/firestone