skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 3, 2026

Title: Distributed Speed Scaling in Large-Scale Service Systems
Smart Servers, Smarter Speed Scaling: A Decentralized Algorithm for Data Center Efficiency A team of researchers from Georgia Tech and the University of Minnesota has introduced a cutting-edge algorithm designed to optimize energy use in large-scale data centers. As detailed in their paper “Distributed Rate Scaling in Large-Scale Service Systems,” the team developed a decentralized method allowing each server to adjust its processing speed autonomously without the need for communication or knowledge of system-wide traffic. The algorithm uses idle time as a local signal to guide processing speed, ensuring that all servers converge toward a globally optimal performance rate. This innovation addresses a critical issue in modern computing infrastructure: balancing energy efficiency with performance under uncertainty and scale. The authors demonstrate that their approach not only stabilizes the system but achieves asymptotic optimality as the number of servers increases. The work is poised to significantly reduce energy consumption in data centers, which are projected to account for up to 8% of U.S. electricity use by 2030.  more » « less
Award ID(s):
2113027 2240982
PAR ID:
10634973
Author(s) / Creator(s):
; ;
Publisher / Repository:
INFORMS
Date Published:
Journal Name:
Operations Research
ISSN:
0030-364X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider a large-scale parallel-server loss system with an unknown arrival rate, where each server is able to adjust its processing speed. The objective is to minimize the system cost, which consists of a power cost to maintain the servers' processing speeds and a quality of service cost depending on the tasks' processing times, among others. We draw on ideas from stochastic approximation to design a novel speed scaling algorithm and prove that the servers' processing speeds converge to the globally asymptotically optimum value. Curiously, the algorithm is fully distributed and does not require any communication between servers. Apart from the algorithm design, a key contribution of our approach lies in demonstrating how concepts from the stochastic approximation literature can be leveraged to effectively tackle learning problems in large-scale, distributed systems. En route, we also analyze the performance of a fully heterogeneous parallel-server loss system, where each server has a distinct processing speed, which might be of independent interest. 
    more » « less
  2. Gibbons, P; Pekhimenko, G; De_Sa, C (Ed.)
    Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They also may be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual, heavyweight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce the locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism, while minimizing aggregation time and resource consumption. Our preliminary experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems. 
    more » « less
  3. Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They also may be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual, heavyweight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce the locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism, while minimizing aggregation time and resource consumption. Our preliminary experimental results show that LIFL achieves significant improvement in resource efficiency and aggregation speed for supporting FL at scale, compared to existing serverful and serverless FL systems. 
    more » « less
  4. Escalating application demand and the end of Dennard scaling have put energy management at the center of cloud operations. Because of the huge cost and long lead time of provisioning new data centers, operators want to squeeze as much use out of existing data centers as possible, often limited by power provisioning fixed at the time of construction. Workload demand spikes and the inherent variability of renewable energy, as well as increased power unreliability from extreme weather events and natural disasters, make the data center power management problem even more challenging. We believe it is time to build a power control plane to provide fine-grained observability and control over data center power to operators. Our goal is to help make data centers substantially more elastic with respect to dynamic changes in energy sources and application needs, while still providing good performance to applications. There are many use cases for cloud power control, including increased power oversubscription and use of green energy, resilience to power failures, large-scale power demand response, and improved energy efficiency. 
    more » « less
  5. The rapid growth in data center workloads and the increasing complexity of modern applications have led to significant contradictions between computational performance and thermal management. Traditional air-cooling systems, while widely adopted, are reaching their limits in handling the rising thermal footprints and higher rack power densities of next-generation servers, often resulting in thermal throttling and decreased efficiency, emphasizing the need for more efficient cooling solutions. Direct-to-chip liquid cooling with cold plates has emerged as a promising solution, providing efficient heat dissipation for high-performance servers. However, challenges remain, such as ensuring system stability under varying thermal loads and optimizing integration with existing infrastructure. This comprehensive study digs into the area of data center liquid cooling, providing a novel, comprehensive experimental investigation of the critical steps and tests necessary for commissioning coolant distribution units (CDUs) in direct-to-chip liquid-cooled data centers. It carefully investigates the hydraulic, thermal, and energy aspects, establishing the groundwork for Liquid-to-Air (L2A) CDU data centers. A CDU’s performance was evaluated under different conditions. First, the CDU’s maximum cooling capacity was evaluated and found to be as high as 89.9 kW at an approach temperature difference (ATD) of 18.3 ◦C with a 0.83 heat exchanger effectiveness. Then, to assess the cooling performance and stability of the CDU, a low-power test and a transient thermohydraulic test were conducted. The results showed instability in the supply fluid temperature (SFT) caused by the oscillation in fan speed at low thermal loads. Despite this, heat removal rates remained constant across varying supply air temperatures (SATs), and a partial power usage effectiveness (PPUE) of 1.042 was achieved at 100 % heat load (86 kW) under different SATs. This research sets a foundation for improving L2A CDU performance and offers practical insights for overcoming current cooling limitations in data centers. 
    more » « less