Accelerating AI with Open Standards: UEC’s Expanding Vision

In June, the Ultra Ethernet Consortium (UEC) released its 1.0 specification. This document culminated thousands of hours of work towards enhancing Ethernet for the needs of AI and HPC workloads. It focuses on defining a new network stack – from the application through the libfabric API to the new Ultra Ethernet Transport (UET) and other link and phy optimizations. The specification enables a breadth of hardware and software optimizations designed from the ground up to enable performant next generation AI and HPC deployments. With the hardware specification complete, ongoing work encompasses a wide range of tasks necessary to ensure a robust software ecosystem – everything from standardizing configuration APIs to coordinating on the necessary Linux kernel work.

Download 1.0.1 Specification

Beyond 1.0, AI and HPC workloads are rapidly evolving. We continue to work towards standardizing the best ideas developed in the field. This rapid evolution is readily apparent in the traffic characteristics of AI and HPC traffic, where congestion patterns evolve at a faster pace than the underlying hardware. To address these rapid changes, flexible congestion management mechanisms are required. To this end, we’re standardizing Programmable Congestion Management (PCM), enabling anyone to implement a new congestion control algorithm to address changing workloads using a standard language. That algorithm will be usable on any NIC that supports UE PCM. To further improve congestion control, the UEC is additionally standardizing Congestion Signaling (CSIG). This enhancement allows packets to carry high-fidelity information about network congestion, which enables the transport protocol to react more accurately and quickly to dynamic conditions.

A second major focus for our work going forward is improving the performance of small messages. The UET 1.0 protocol is designed to support forward-looking scales of a million hosts, all coordinating as part of a single job. To achieve this, a basic UET packet has 104 bytes of headers, which is only a 2.5% overhead for the 4096-byte packet targeted by many workloads, but significant relative to smaller transactions( e.g., 256-byte transfers). UEC is pursuing optimizations across all layers of the stack, including a reduced-size forwarding header, looking to cut this overhead in half for optimized deployments. This smaller packet overhead leads to better efficiency in workloads with small payloads – be it HPC workloads or local scale-up networks.

To improve the performance of scale-up networks, we’re also working on an optimized transport layer. Recognizing that scale-up networks are simpler than scale-out networks but have unique traffic management requirements, work is in progress on a scale-up-focused transport that leverages the capabilities of the robust Ultra Ethernet Transport defined in the 1.0 specification.

Finally, we continue to work on standardizing In-Network Collectives (INC) for Ethernet networks. By taking the reduction operation that is common to AI and HPC workloads out of hosts and placing it in the network, INC can double the performance of all-reduce. Latency-sensitive operations such as barriers or small all-gathers also see substantial performance gains. These optimizations can result in significant improvements in certain AI and HPC use cases.

We hope to wrap up many of these technologies in an updated specification release the first half of next year.

Beyond these exciting technologies, we are collaborating with multiple other standards bodies to make our vision for Ethernet a reality. We’re working with SNIA™ and NVM Express™ to improve the performance of storage applications. And, we’re coordinating with OCP to standardize our Switch Abstraction Interface (SAI) and Redfish config models, along with other networking technologies such as CSIG that have applications outside of AI/HPC networks.

We are very excited about the future of our work and look forward to continuing to improve Ethernet as the dominant platform for AI and HPC networks.

If you haven’t already, we encourage you to read our 1.0 specification here. Additionally, we invite you to join the UEC, contribute your ideas, and help define the future of networking for AI and HPC.

Tags: