skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Latency-Aware Dynamic Server and Cooling Capacity Provisioner for Data Centers
Data center operators generally overprovision IT and cooling capacities to address unexpected utilization increases that can violate service quality commitments. This results in energy wastage. To reduce this wastage, we introduce HCP (Holistic Capacity Provisioner), a service latency aware management system for dynamically provisioning the server and cooling capacity. Short-term load prediction is used to adjust the online server capacity to concentrate the workload onto the smallest possible set of online servers. Idling servers are completely turned off based on a separate long-term utilization predictor. HCP targets data centers that use chilled air cooling and varies the cooling provided commensurately, using adjustable aperture tiles and speed control of the blower fans in the air handler. An HCP prototype supporting a server heterogeneity is evaluated with real-world workload traces/requests and realizes up to 32% total energy savings while limiting the 99th-percentile and average latency increases to at most 6.67% and 3.24%, respectively, against a baseline system where all servers are kept online.  more » « less
Award ID(s):
1738793
NSF-PAR ID:
10338837
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
SoCC '21, Seattle, WA
Page Range / eLocation ID:
335 to 349
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    With the rapid development of the Internet of Things (IoT), computational workloads are gradually moving toward the internet edge for low latency. Due to significant workload fluctuations, edge data centers built in distributed locations suffer from resource underutilization and requires capacity underprovisioning to avoid wasting capital investment. The workload fluctuations, however, also make edge data centers more suitable for battery-assisted power management to counter the performance impact due to underprovisioning. In particular, the workload fluctuations allow the battery to be frequently recharged and made available for temporary capacity boosts. But, using batteries can overload the data center cooling system which is designed with a matching capacity of the power system. In this paper, we design a novel power management solution, DeepPM, that exploits the UPS battery and cold air inside the edge data center as energy storage to boost the performance. DeepPM uses deep reinforcement learning (DRL) to learn the data center thermal behavior online in a model-free manner and uses it on-the-fly to determine power allocation for optimum latency performance without overheating the data center. Our evaluation shows that DeepPM can improve latency performance by more than 50% compared to a power capping baseline while the server inlet temperature remains within safe operating limits (e.g., 32°C). 
    more » « less
  2. Modern Information Technology (IT) servers are typically assumed to operate in quiescent conditions with almost zero static pressure differentials between inlet and exhaust. However, when operating in a data center containment system the IT equipment thermal status is a strong function of the non- homogenous environment of the air space, IT utilization workloads and the overall facility cooling system design. To implement a dynamic and interfaced cooling solution, the interdependencies of variabilities between the chassis, rack and room level must be determined. In this paper, the effect of positive as well as negative static pressure differential between inlet and outlet of servers on thermal performance, fan control schemes, the direction of air flow through the servers as well as fan energy consumption within a server is observed at the chassis level. In this study, a web server with internal air-flow paths segregated into two separate streams, each having dedicated fan/group of fans within the chassis, is operated over a range of static pressure differential across the server. Experiments were conducted to observe the steady-state temperatures of CPUs and fan power consumption. Furthermore, the server fan speed control scheme’s transient response to a typical peak in IT computational workload while operating at negative pressure differentials across the server is reported. The effects of the internal air flow paths within the chassis is studied through experimental testing and simulations for flow visualization. The results indicate that at higher positive differential pressures across the server, increasing server fans speeds will have minimal impact on the cooling of the system. On the contrary, at lower, negative differential pressure server fan power becomes strongly dependent on operating pressure differential. More importantly, it is shown that an imbalance of flow impedances in internal airflow paths and fan control logic can onset recirculation of exhaust air within the server. For accurate prediction of airflow in cases where negative pressure differential exists, this study proposes an extended fan performance curve instead of a regular fan performance curve to be applied as a fan boundary condition for Computational Fluid Dynamics simulations. 
    more » « less
  3. Over the past few years, there has been an ever increasing rise in energy consumption by IT equipment in Data Centers. Thus, the need to minimize the environmental impact of Data Centers by optimizing energy consumption and material use is increasing. In 2011, the Open Compute Project was started which was aimed at sharing specifications and best practices with the community for highly energy efficient and economical data centers. The first Open Compute Server was the ‘ Freedom’ Server. It was a vanity free design and was completely custom designed using minimum number of components and was deployed in a data center in Prineville, Oregon. Within the first few months of operation, considerable amount of energy and cost savings were observed. Since then, progressive generations of Open Compute servers have been introduced. Initially, the servers used for compute purposes mainly had a 2 socket architecture. In 2015, the Yosemite Open Compute Server was introduced which was suited for higher compute capacity. Yosemite has a system on a chip architecture having four CPUs per sled providing a significant improvement in performance per watt over the previous generations. This study mainly focuses on air flow optimization in Yosemite platform to improve its overall cooling performance. Commercially available CFD tools have made it possible to do the thermal modeling of these servers and predict their efficiency. A detailed server model is generated using a CFD tool and its optimization has been done to improve the air flow characteristics in the server. Thermal model of the improved design is compared to the existing design to show the impact of air flow optimization on flow rates and flow speeds which in turn affects CPU die temperatures and cooling power consumption and thus, impacting the overall cooling performance of the Yosemite platform. Emphasis is given on effective utilization of fans in the server as compared to the original design and improving air flow characteristics inside the server via improved ducting. 
    more » « less
  4. In typical data centers, the servers and IT equipment are cooled by air and almost half of total IT power is dedicated to cooling. Hybrid cooling is a combined cooling technology with both air and water, where the main heat generating components are cooled by water or water-based coolants and rest of the components are cooled by air supplied by CRAC or CRAH. Retrofitting the air-cooled servers with cold plates and pumps has the advantage over thermal management of CPUs and other high heat generating components. In a typical 1U server, the CPUs were retrofitted with cold plates and the server tested with raised coolant inlet conditions. The study showed the server can operate with maximum utilization for CPUs, DIMMs, and PCH for inlet coolant temperature from 25–45 °C following the ASHRAE guidelines. The server was also tested for failure scenarios of the pumps and fans with reducing numbers of fans and pumps. To reduce cooling power consumption at the facility level and increase air-side economizer hours, the hybrid cooled server can be operated at raised inlet air temperatures. The trade-off in energy savings at the facility level due to raising the inlet air temperatures versus the possible increase in server fan power and component temperatures is investigated. A detailed CFD analysis with a minimum number of server fans can provide a way to find an operating range of inlet air temperature for a hybrid cooled server. Changes in the model are carried out in 6SigmaET for an individual server and compared to the experimental data to validate the model. The results from this study can be helpful in determining the room level operating set points for data centers housing hybrid cooled server racks. 
    more » « less
  5. As the online frameworks and services are growing rapidly with the evolution of web-based Artificial Intelligence (AI) applications, server rooms are upgrading in computational capacity and size to keep up with these demands. Enterprise companies with their limited capacity server rooms struggle to keep up with these increasing computational demands. Hence, some of them end up outsourcing their servers to co-located facilities (Co-Lo) and the others choose to upgrade their existing server rooms. Correspondingly, the thermal load associated with such upgrades is typically tremendous. Approximately around 40% of the power consumed by datacentres is dissipated as heat. Conventional HVAC systems fail to satisfy the requirements of such server capacities. Not only do they struggle to fulfil the cooling load, but their maldistribution of cool air into the server room forms a major cause for hotspots formation. To tackle this issue, Liquid-to-Air (L2A) Coolant Distribution Units (CDUs) are being used as a liquid-based cooling solution for rack-level cooling. This type of CDUs provide efficient cooling for servers through liquid coolant that is distributed into cooling loops mounted on top of each server board. The generated heat is curried away using this liquid coolant back to the CDU, which then dissipates it into the surrounding air using dedicated pumps, fans, and heat exchanger, hence the name Liquid-to-Air.

    In the present work, one of the most popular liquid cooling strategies is explored based on various scenarios. the performance of a 24-kW liquid to Air (L2A) CDU is judged based on cooling effect, stability, and reliability. The study is curried out experimentally, in which a test rack with three thermal test vehicles (TTVs) are used to investigate various operation scenarios. Both liquid coolant and air sides of this experimental setup are equipped with the required instrumentations to monitor and analyse the tests. All test cases were taken in a room with limited air conditioning to resemble the environment of upgraded server rooms with conventional AC systems. Moreover, the impact of using such high-power density cooling unit on the server room environment with restricted HVAC system is also brought to light. Environmental and human comfort parameters such as noise, air velocity, and ambient temperature are measured under various operation conditions and benchmarked against their ranges for human comfort as listed in ASHREE standards. At the end of this research, recommendations for best practice are provided along with areas of enhancement for the selected system.

     
    more » « less