Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Effectively balancing traffic in datacenter networks is a crucial operational goal. Most existing load balancing approaches are handcrafted to the structure of the network and/or network workloads. Thus, new load balancing strategies are required if the underlying network conditions change, e.g., due to hard or grey failures, network topology evolution, or workload shifts. While we can theoretically derive the optimal load balancing strategy by solving an optimization problem given certain traffic and topology conditions, these problems take too much time to solve and makes the derived solution stale to deploy. In this paper, we describe a load balancing scheme Learned Load Balancing (LLB), which is a general approach to finding an optimal load balancing strategy for a given network topology and workload, and is fast enough in practice to deploy the inferred strategies. LLB uses deep supervised learning techniques to learn how to handle different traffic patterns and topology changes, and adapts to any failures in the underlying network. LLB leverages emerging trends in network telemetry, programmable switching, and “smart” NICs. Our experiments show that LLB performs well under failures and can be expanded to more complex, multi-layered network topologies. We also prototype neural network inference on smartNICs to demonstrate the workability of LLB.more » « lessFree, publicly-accessible full text available July 1, 2025
-
The wide adoption of the emerging SmartNIC technology creates new opportunities to offload application-level computation into the networking layer, which frees the burden of host CPUs, leading to performance improvement. Shuffle, the all-to-all data exchange process, is a critical building block for network communication in distributed data-intensive applications and can potentially benefit from SmartNICs. In this paper, we develop SmartShuffle, which accelerates the data-intensive application's shuffle process by offloading various computation tasks into the SmartNIC devices. SmartShuffle supports offloading both low-level network functions, including data partitioning and network transport, and high-level computation tasks, including filtering, aggregation, and sorting. SmartShuffle adopts a coordinated offload architecture to make sender-side and receiver-side SmartNICs jointly contribute to the benefits of shuffle computation offload. SmartShuffle carefully manages the tight and time-varying computation and memory constraints on the device. We propose a liquid offloading approach, which dynamically migrates operators between the host CPU and the SmartNIC at runtime such that resources in both devices are fully utilized. We prototype SmartShuffle on the Stingray SoC SmartNICs and plug it into Spark. Our evaluation shows that SmartShuffle improves host CPU efficiency and I/O efficiency with lower job completion time. SmartShuffle outperforms Spark, and Spark RDMA by up to 40% on TPC-H.more » « less