skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Failure Analysis of Direct Liquid Cooling System in Data Centers
In this paper, the impact of direct liquid cooling (DLC) system failure on the IT equipment is studied experimentally. The main factors that are anticipated to affect the IT equipment response during failure are the CPU utilization, coolant set point temperature (SPT) and the server type. These factors are varied experimentally and the IT equipment response is studied in terms of chip temperature and power, CPU utilization and total server power. It was found that failure of the cooling system is hazardous and can lead to data center shutdown in less than a minute. Additionally, the CPU frequency throttling mechanism was found to be vital to understand the change in chip temperature, power, and utilization. Other mechanisms associated with high temperatures were also observed such as the leakage power and the fans speed change. Finally, possible remedies are proposed to reduce the probability and the consequences of the cooling system failure  more » « less
Award ID(s):
1738793
PAR ID:
10058025
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ASME 2017 International Technical Conference and Exhibition on Packaging and Integration of Electronic and Photonic Microsystems collocated with the ASME 2017 Conference on Information Storage and Processing Systems
Page Range / eLocation ID:
V001T02A008
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. During the lifespan of a data center, power outages and blower cooling failures are common occurrences. Given that data centers have a vital role in modern life, it is especially important to understand these failures and their effects. A previous study [16] showed that cold aisle containment might have a negative impact on IT equipment uptime during a blower failure. This new study further analyzed the impact of containment on IT equipment uptime during a CRAH blower failure. It also compared the IT equipment performance both with and without a pressure relief mechanism implemented in the containment system. The results show that the effect of implementing pressure relief in containment solution on the IT equipment performance and response could vary and depend on the server's airflow, generation and hence types of servers deployed in cold aisle enclosure. The results also showed that when compared to the discrete sensors, the IPMI inlet temperature sensors underestimate the Ride Through Time (RTT) by 32%. This means that the RTT calculations based on the IPMI inlet sensors may be inaccurate due to variations in the sensor readings; as they exist today; in these servers. as discussed in a previous study [26]. Additionally, it was shown that all Dell PowerEdge 2950 servers have a similar IPMI inlet temperature reading, regardless of mounting location. As external system resistance increases during cooling failure, the servers exhibit internal recirculation through their weaker power supply fans, which is reflected in the high IPMI inlet temperature readings. For this server specifically, a pressure relief mechanism reduces the external resistance, thereby eliminating internal recirculation and resulting in lower IPMI inlet temperature readings. This in turn translates to a lower RTT. However, pressure relief showed conflicting results where the discrete sensors showed an increase in inlet temperature when pressure relief was introduced, thereby reducing the RTT. The CPU temperatures conformed with the discrete sensor data, indicating that containment helped increase the RTT of the servers during failure. 
    more » « less
  2. Over the past few years, there has been an ever increasing rise in energy consumption by IT equipment in Data Centers. Thus, the need to minimize the environmental impact of Data Centers by optimizing energy consumption and material use is increasing. In 2011, the Open Compute Project was started which was aimed at sharing specifications and best practices with the community for highly energy efficient and economical data centers. The first Open Compute Server was the ‘ Freedom’ Server. It was a vanity free design and was completely custom designed using minimum number of components and was deployed in a data center in Prineville, Oregon. Within the first few months of operation, considerable amount of energy and cost savings were observed. Since then, progressive generations of Open Compute servers have been introduced. Initially, the servers used for compute purposes mainly had a 2 socket architecture. In 2015, the Yosemite Open Compute Server was introduced which was suited for higher compute capacity. Yosemite has a system on a chip architecture having four CPUs per sled providing a significant improvement in performance per watt over the previous generations. This study mainly focuses on air flow optimization in Yosemite platform to improve its overall cooling performance. Commercially available CFD tools have made it possible to do the thermal modeling of these servers and predict their efficiency. A detailed server model is generated using a CFD tool and its optimization has been done to improve the air flow characteristics in the server. Thermal model of the improved design is compared to the existing design to show the impact of air flow optimization on flow rates and flow speeds which in turn affects CPU die temperatures and cooling power consumption and thus, impacting the overall cooling performance of the Yosemite platform. Emphasis is given on effective utilization of fans in the server as compared to the original design and improving air flow characteristics inside the server via improved ducting. 
    more » « less
  3. In typical data centers, the servers and IT equipment are cooled by air and almost half of total IT power is dedicated to cooling. Hybrid cooling is a combined cooling technology with both air and water, where the main heat generating components are cooled by water or water-based coolants and rest of the components are cooled by air supplied by CRAC or CRAH. Retrofitting the air-cooled servers with cold plates and pumps has the advantage over thermal management of CPUs and other high heat generating components. In a typical 1U server, the CPUs were retrofitted with cold plates and the server tested with raised coolant inlet conditions. The study showed the server can operate with maximum utilization for CPUs, DIMMs, and PCH for inlet coolant temperature from 25–45 °C following the ASHRAE guidelines. The server was also tested for failure scenarios of the pumps and fans with reducing numbers of fans and pumps. To reduce cooling power consumption at the facility level and increase air-side economizer hours, the hybrid cooled server can be operated at raised inlet air temperatures. The trade-off in energy savings at the facility level due to raising the inlet air temperatures versus the possible increase in server fan power and component temperatures is investigated. A detailed CFD analysis with a minimum number of server fans can provide a way to find an operating range of inlet air temperature for a hybrid cooled server. Changes in the model are carried out in 6SigmaET for an individual server and compared to the experimental data to validate the model. The results from this study can be helpful in determining the room level operating set points for data centers housing hybrid cooled server racks. 
    more » « less
  4. Abstract Airside economizers lower the operating cost of data centers by reducing or eliminating mechanical cooling. It, however, increases the risk of reliability degradation of information technology (IT) equipment due to contaminants. IT Equipment manufacturers have tested equipment performance and guarantee the reliability of their equipment in environments within ISA 71.04-2013 severity level G1 and the ASHRAE recommended temperature-relative humidity (RH) envelope. IT Equipment manufacturers require data center operators to meet all the specified conditions consistently before fulfilling warranty on equipment failure. To determine the reliability of electronic hardware in higher severity conditions, field data obtained from real data centers are required. In this study, a corrosion classification coupon experiment as per ISA 71.04-2013 was performed to determine the severity level of a research data center (RDC) located in an industrial area of hot and humid Dallas. The temperature-RH excursions were analyzed based on time series and weather data bin analysis using trend data for the duration of operation. After some period, a failure was recorded on two power distribution units (PDUs) located in the hot aisle. The damaged hardware and other hardware were evaluated, and cumulative corrosion damage study was carried out. The hypothetical estimation of the end of life of components is provided to determine free air-cooling hours for the site. There was no failure of even a single server operated with fresh air-cooling shows that using evaporative/free air cooling is not detrimental to IT equipment reliability. This study, however, must be repeated in other geographical locations to determine if the contamination effect is location dependent. 
    more » « less
  5. About 40% of the energy utilized in data centers is used for cooling systems, and this percentage has increased significantly in recent years. Data center server racks receive indirect cooling from computer room air conditioning units (CRACs) or computer room air handler units (CRAHs), and chilled air is sent to the racks through a raised floor plenum to cool the server room. This approach is inefficient because the server room has excessive cooling while the IT equipment has inadequate cooling. The data center industry has begun to use thermally and energy-efficient single-phase liquid cooling solutions as a result of the tremendous increase in IT power density and energy usage. One of the most popular liquid cooling systems will be examined in the current study, which is based on numerous circumstances. Under various secondary coolant conditions, the hydro-thermal performance of different liquid-to-air and liquid-to-liquid coolant distribution units (CDUs) will be assessed experimentally using multi-racks loaded with different numbers of thermal testing vehicles (TTVs), and the system response to any change in the flow will be examined by disconnecting and reconnecting the TTVs cooling loops in the multi-racks through different sequences of transient events. Moreover, a set of rules and guidelines will be established to commission these units. 
    more » « less