Ultra Ethernet Specification Update - Ultra Ethernet Consortium

As Ultra Ethernet Consortium (UEC) speeds toward a 1.0 specification, many details have solidified. We are now ready to share more about what the initial release of the groundbreaking Ultra Ethernet standard will include.

Ultra Ethernet encompasses more than a single feature or function. While the Ultra Ethernet Transport (UET) protocol is central to the UEC’s work, UET optimizes other layers to enhance both AI and high-performance computing (HPC) workloads. Ultra Ethernet includes tuning across many layers to enhance both AI and high-performance computing (HPC) workloads. This blog will review some of the new advances in the following areas:

Software
Transport
Congestion Control
In-Network Collectives (INCs)
Security
Link Layer

Ultra Ethernet Software APIs

The software for UET is based on libfabric v2.0 APIs, extended to support the new UET. To ensure a comprehensive definition and smooth transition from specifications to products, UEC is developing UET reference code running over standard network interface cards (NICs). This code ensures correct network operation and demonstrates that UET is well matched to libfabric APIs.

Several operations are supported at the libfabric layer, including the familiar Send, Write, and Read. “Rendezvous” variants of these operations are provided, where the sender and receiver exchange buffer information, ensuring buffers are available on the receiver before the sender transmits.

Both tagged and untagged operations are supported. Untagged operations use familiar send/receive semantics where buffers are consumed sequentially. Tagged operations allow communication partners to use a “tag” that names a buffer without requiring the pre-exchange of address information typical with read and write semantics.

UET introduces a new type of Send operation, Deferrable Send, which is an optimistic rendezvous that assumes a buffer is available at the destination, avoiding the round-trip delay associated with buffer availability checks. If a buffer is not available, the receiver can resume the send operation when it is, reducing reliance on sender timers.

UET

UET embraces Remote Direct Memory Access (RDMA) mechanisms due to their superior performance, providing hardware-centric placement directly from network to host or accelerator memory, known as direct data placement. This bypasses the operating system kernel, ensuring the lowest latency and highest performance. UET takes a clean-slate approach in its RDMA transport protocol to meet the demanding needs of evolving AI and HPC workloads.

UET eliminates connection establishment latency and minimizes persistent connection state. Connection state is not required for every possible communicating peer; a pool of connection resources can be shared among all active endpoints. For that reason, UET employs an ephemeral connection approach where no handshake is required to establish a connection before transmission, and connection state can be discarded at the end of a transaction. Peers in UET can begin transmission, establish connection state while communicating, and then discard that state at the end of the transaction.

Eliminating connection establishment time and reducing connection state are crucial for several reasons:

Scalability: The need to store connection state in the fast path limits the size of a cluster.
Latency: Legacy protocols require a connection handshake before data transmission add latency, which is particularly important for AI inference and HPC workloads.
Cost: Maintaining large state tables in hardware is expensive in terms of silicon area and power.

UET is designed to operate well over best-effort networks that may drop packets due to congestion or link errors. Operation over such networks is critical for modern networks and data centers, since it aids scalability and avoids the need to configure and operate a lossless network.

However, UET performs well over lossless networks where Priority Flow Control is used to ensure that a packet is transmitted to the link partner only when it is known that free buffer space is available. To avoid network deadlocks in lossless networks and prioritize latency-sensitive control traffic in both lossless and best-effort networks, UET employs two Traffic Classes (TCs). The UET reliability scheme carefully maps packets to TCs and queues to avoid deadlocks between responses and requests in a lossless environment.

To enable optimized implementations, Ultra Ethernet offers three different profiles:

AI Base: Targets core AI use cases, including distributed inference and training.
AI Full: Adds functionality for AI use cases, such as Deferrable Send, exact-matching tagged sends, and extended atomic operations.
HPC: Includes advanced tagging semantics, more atomic operations, and full rendezvous support for HPC workloads beyond AI.

These profiles identify subsets of the full functionality defined in the UET specification and are designed to address different use cases with distinct functionality and implementation trade-offs.

UET Congestion Control

UET has added an optimized load-balancing mechanism that overcomes ECMP limitations. UET senders spray packets across many paths to the destination, avoiding the ECMP flow collision problem by loading links much more evenly. UET senders then use real-time network congestion information obtained in a lightweight manner to optimize spraying, reduce network queues, and achieve high network utilization. If one path is determined to be congested, packets are immediately shifted to another non-congested path. Network congestion information from the complete network is utilized, including data from the endpoints, transit nodes in the fabric, and the last hop, which is particularly impacted by incast events. Incast can occur, for example, when extremely high-bandwidth flows associated with accelerators create transient congestion during collective operations and when AI workloads create bulk storage traffic.

UET also includes two newly designed congestion-control approaches. The first is a powerful new sender-based congestion-control scheme that builds upon multiple existing innovative ideas and uses techniques optimized for AI and HPC workloads. The result is an adaptive window-based scheme in which senders coordinate and avoid overloading the receiver and network by adjusting their window size based on measured round-trip time (RTT), ECN markings, and packet loss.

The second, optional, congestion-control mechanism is a novel receiver-based scheme in which a sender requests permission to send and the receiving endpoint grants credits, allowing the sender to transmit in a way that does not overwhelm the receiver. This scheme will likely be especially effective at mitigating incast events. This scheme also enables optimistic transmission where limited credits are pre-allocated, and transmission can start immediately, before receiving permission from the receiver.

Both congestion-control mechanisms optimize network performance for the unique characteristics of AI and HPC workloads and provide the following differentiating characteristics of UET:

The ability to ramp from “zero” to “wire rate” instantaneously and back off very rapidly in the face of congestion (preserving the well-known and widely deployed additive-increase/multiplicative-decrease scheme, adapted for AI and HPC workloads).
Optimized operation in low-latency, short-RTT environments with very high-speed links, which are common in AI and HPC networks, where transfers may only last a small handful of RTTs. This necessitates highly responsive congestion management that is neither too impulsive nor too overactive.
Efficient management of multi-path transmission, with techniques to avoid specific paths that may be congested.
Ability to leverage optional “packet trimming” in network switches, which allows packets to be truncated rather than dropped in the fabric during congestion events. As packet spraying causes reordering, it becomes more complicated to detect loss. Packet trimming gives the receiver and the sender an early explicit indication of congestion, allowing immediate loss recovery in spite of reordering, and is a critical feature in the low-RTT environments where UET is designed to operate.
Congestion control mechanisms designed to address congestion in the network and also enable precise flow control within the host. This host-based congestion can occur, for example, within the receiver’s memory subsystem or on an internal host bus such as PCIe between an accelerator and a NIC.

INCs

Another exciting feature of the upcoming Ultra Ethernet specification is UET’s support for INCs, sometimes known as “switch offload,” where network collective operations are offloaded or hardware-accelerated in a switch. This is the first such standard over Ethernet!

Switch vendors, accelerator/GPU designers, system vendors, and hyperscale network operators jointly designed UET’s INC support as a standard protocol for accelerator endpoints to communicate with neighboring network devices that actively participate in the distributed computation. This protocol is lightweight and suitable for hardware implementation, recognizing the constraints associated with implementation in a network switch operating at hundreds of terabits per second with hundreds of network interfaces.

UET Security

Ultra Ethernet has integrated security from the outset, incorporating authentication, authorization, and confidentiality within UET. It leverages ideas from IPSec and PSP*, such as the use of AES-GCM, key derivation functions, and replay protection. However, UET security differs from existing security protocols in a few key ways:

It supports efficient group keying, which allows many endpoints that are part of a single “security domain” (common for endpoints participating in an AI or HPC job) to communicate securely while optimizing the state that must be maintained.
It supports independent client–server keys to allow for efficient scaling of server resources across many trust domains.
This is all done while preserving UET’s ephemeral connection architecture, with its benefits described above.

New Link Layer Features

At the link layer, UET introduces a standard for Link Layer Retry (LLR). AI servers employ extremely high bandwidth density over many links, and AI/HPC workloads perform parallel processing, where any underperforming link slows the whole group. LLR is an important tool to improve the reliability of marginal links that may be a result of transient physical layer disruption, intermittent failing components, or faulty wiring. This approach helps avoid a large impact on the overall tail latency in a distributed computation that may involve many tens or hundreds of thousands of accelerators.

LLR is a hop-by-hop technology negotiated by the Link Layer Discovery Protocol (LLDP) that allows packets to be retransmitted in the case of loss between two link partners. On an LLR link, each packet is held at the sender in a buffer until the receiver acknowledges its receipt. UEC-initiated LLDP extensions support the discovery and configuration of LLR parameters.

Watch This Space!

Drawing upon the innovation and expertise of its 90+ consortium members, UEC is aggressively working to finalize the initial specification and drive Ethernet forward as the premier networking technology for AI and HPC workloads. Further extensions and new developments in storage, performance and debugging, compliance, and management are underway. Stay tuned for further updates!

*The PSP Security Protocol details are at https://github.com/google/psp.