skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Adaptive Approach for Dealing with Flow Disruption in Virtualized Water-Cooled Data Centers
The recent availability of water cooling systems that can be easily retrofitted to stock servers by replacing the heatsinks with coldplates has made it possible to use such systems for non-HPC cloud/data center servers. These cooling systems use pumps to circulate water and the pumps are likely to fail in the long run. We present a technique to handle flow disruptions caused by the pump failures in a virtualized environment. The solution uses an estimation of the residual cooling capacity left in the failed cooling system to adaptively adjust the CPU clock frequency as virtual machines are migrated off the racks affected by the failure. This minimizes the degradation of the tail latencies of the served requests during the migration interval for all servers affected by the failure, as seen in the experimental results  more » « less
Award ID(s):
1738793
PAR ID:
10162330
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2019 IEEE 12th International Conference on Cloud Computing (CLOUD)
Page Range / eLocation ID:
296 to 300
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent availability of warm water cooling systems that can be easily retrofitted to stock server by replacing the heatsinks with coldplates have made it possible to use such cooling for non-HPC cloud/data center servers. These cooling systems use internal pumps in rack-level heat exchangers as well as external pumps that can fail. We present a systematic study of the pump failures that disrupt flow in the cooling system, propose and experimentally evaluate techniques for reducing service disruptions during failures while avoiding damage to the servers where water cooling has failed. 
    more » « less
  2. In typical data centers, the servers and IT equipment are cooled by air and almost half of total IT power is dedicated to cooling. Hybrid cooling is a combined cooling technology with both air and water, where the main heat generating components are cooled by water or water-based coolants and rest of the components are cooled by air supplied by CRAC or CRAH. Retrofitting the air-cooled servers with cold plates and pumps has the advantage over thermal management of CPUs and other high heat generating components. In a typical 1U server, the CPUs were retrofitted with cold plates and the server tested with raised coolant inlet conditions. The study showed the server can operate with maximum utilization for CPUs, DIMMs, and PCH for inlet coolant temperature from 25–45 °C following the ASHRAE guidelines. The server was also tested for failure scenarios of the pumps and fans with reducing numbers of fans and pumps. To reduce cooling power consumption at the facility level and increase air-side economizer hours, the hybrid cooled server can be operated at raised inlet air temperatures. The trade-off in energy savings at the facility level due to raising the inlet air temperatures versus the possible increase in server fan power and component temperatures is investigated. A detailed CFD analysis with a minimum number of server fans can provide a way to find an operating range of inlet air temperature for a hybrid cooled server. Changes in the model are carried out in 6SigmaET for an individual server and compared to the experimental data to validate the model. The results from this study can be helpful in determining the room level operating set points for data centers housing hybrid cooled server racks. 
    more » « less
  3. Given the vital rule of data center availability and since the inlet temperature of the IT equipment increase rapidly until reaching a certain threshold value after which IT starts throttling or shut down because of overheat during cooling system failure. Hence, it is especially important to understand failures and their effects. This study presented experimental investigation and analysis of a facility-level cooling system failure scenario in which chilled water interruption introduced to the data center. Quantitative instrumentation tools including wireless technology such as wireless temperature and pressure sensors were used to measure the discrete air inlet temperature and pressure differential though cold aisle enclosure, respectively. In addition, Intelligent Platform Management Interface (IPMI) and cooling system data during failure/recovery were reported. Furthermore, the IT equipment performance and response for opened and contained environments were simulated and compared. Finally, an experiment based analysis of the Ride Through Time (RTT) of servers during chilled water interruption of the cooling infrastructure presented as well. The results showed that for all three classes of servers tested during the cooling failure, CAC helped keep the server’s cooler for longer. The containment provided a barrier between the hot and cold air streams and caused slight negative pressure to build up, which allowed the servers to pull cold air from the underfloor plenum. In addition, the results show that the effect of CAC in containment solutions on the IT equipment performance and response could vary and depend on the server’s airflow, generation and hence types of servers deployed in cold aisle enclosure. Moreover, it was shown that when compared to the discrete sensors, the IPMI inlet temperature sensors underestimate the Ride Through Time (RTT) by 42% and 12% for the CAC and opened cases, respectively. 
    more » « less
  4. Modern day data centers are operated at high power for increased power density, maintenance, and cooling which covers almost 2 percent (70 billion kilowatt-hours) of the total energy consumption in the US. IT components and cooling system occupy the major portion of this energy consumption. Although data centers are designed to perform efficiently, cooling the high-density components is still a challenge. So, alternative methods to improve the cooling efficiency has become the drive to reduce the cooling cost. As liquid cooling is more efficient for high specific heat capacity, density, and thermal conductivity, hybrid cooling can offer the advantage of liquid cooling of high heat generating components in the traditional air-cooled servers. In this experiment, a 1U server is equipped with cold plate to cool the CPUs while the rest of the components are cooled by fans. In this study, predictive fan and pump failure analysis are performed which also helps to explore the options for redundancy and to reduce the cooling cost by improving cooling efficiency. Redundancy requires the knowledge of planned and unplanned system failures. As the main heat generating components are cooled by liquid, warm water cooling can be employed to observe the effects of raised inlet conditions in a hybrid cooled server with failure scenarios. The ASHRAE guidance class W4 for liquid cooling is chosen for our experiment to operate in a range from 25°C – 45°C. The experiments are conducted separately for the pump and fan failure scenarios. Computational load of idle, 10%, 30%, 50%, 70% and 98% are applied while powering only one pump and the miniature dry cooler fans are controlled externally to maintain constant inlet temperature of the coolant. As the rest of components such as DIMMs & PCH are cooled by air, maximum utilization for memory is applied while reducing the number fans in each case for fan failure scenario. The components temperatures and power consumption are recorded in each case for performance analysis 
    more » « less
  5. The most common approach to air cooling of data centers involves the pressurization of the plenum beneath the raised floor and delivery of air flow to racks via perforated floor tiles. This cooling approach is thermodynamically inefficient due in large part to the pressure losses through the tiles. Furthermore, it is difficult to control flow at the aisle and rack level since the flow source is centralized rather than distributed. Distributed cooling systems are more closely coupled to the heat generating racks. In overhead cooling systems, one can distribute flow to distinct aisles by placing the air mover and water cooled heat exchanger directly above an aisle. Two arrangements are possible: (i.) placing the air mover and heat exchanger above the cold aisle and forcing downward flow of cooled air into the cold aisle (Overhead Downward Flow (ODF)), or (ii.) placing the air mover and heat exchanger above the hot aisle and forcing heated air upwards from the hot aisle through the water cooled heat exchanger (Overhead Upward Flow (OUF)). This study focuses on the steady and transient behavior of overhead cooling systems in both ODF and OUF configurations and compares their cooling effectiveness and energy efficiency. The flow and heat transfer inside the servers and heat exchangers are modeled using physics based approaches that result in differential equation based mathematical descriptions. These models are programmed in the MATLAB™ language and embedded within a CFD computational environment (using the commercial code FLUENT™) that computes the steady or instantaneous airflow distribution. The complete computational model is able to simulate the complete flow and thermal field in the airside, the instantaneous temperatures within and pressure drops through the servers, and the instantaneous temperatures within and pressure drops through the overhead cooling system. Instantaneous overall energy consumption (1st Law) and exergy destruction (2nd Law) were used to quantify overall energy efficiency and to identify inefficiencies within the two systems. The server cooling effectiveness, based on an effectiveness-NTU model for the servers, was used to assess the cooling effectiveness of the two overhead cooling approaches 
    more » « less