skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CFD Simulation and Optimization of the Cooling of Open Compute Machine Learning “Big Sur” Server
In recent years, there have been phenomenal increases in Artificial Intelligence and Machine Learning that require data collection, mining and using data sets to teach computers certain things to learn, analyze image and speech recognition. Machine Learning tasks require a lot of computing power to carry out numerous calculations. Therefore, most servers are powered by Graphics Processing Units (GPUs) instead of traditional CPUs. GPUs provide more computational throughput per dollar spent than traditional CPUs. Open Compute Servers forum has introduced the state-of-the-art machine learning servers “Big Sur” recently. Big Sur unit consists of 4OU (OpenU) chassis housing eight NVidia Tesla M40 GPUs and two CPUs along with SSD storage and hot-swappable fans at the rear. Management of the airflow is a critical requirement in the implementation of air cooling for rack mount servers to ensure that all components, especially critical devices such as CPUs and GPUs, receive adequate flow as per requirement. In addition, component locations within the chassis play a vital role in the passage of airflow and affect the overall system resistance. In this paper, sizeable improvement in chassis ducting is targeted to counteract effects of air diffusion at the rear of air flow duct in “Big Sur” Open Compute machine learning server wherein GPUs are located directly downstream from CPUs. A CFD simulation of the detailed server model is performed with the objective of understanding the effect of air flow bypass on GPU die temperatures and fan power consumption. The cumulative effect was studied by simulations to see improvements in fan power consumption by the server. The reduction in acoustics noise levels caused by server fans is also discussed.  more » « less
Award ID(s):
1738811
PAR ID:
10065927
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2018 17th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Modern Information Technology (IT) servers are typically assumed to operate in quiescent conditions with almost zero static pressure differentials between inlet and exhaust. However, when operating in a data center containment system the IT equipment thermal status is a strong function of the non- homogenous environment of the air space, IT utilization workloads and the overall facility cooling system design. To implement a dynamic and interfaced cooling solution, the interdependencies of variabilities between the chassis, rack and room level must be determined. In this paper, the effect of positive as well as negative static pressure differential between inlet and outlet of servers on thermal performance, fan control schemes, the direction of air flow through the servers as well as fan energy consumption within a server is observed at the chassis level. In this study, a web server with internal air-flow paths segregated into two separate streams, each having dedicated fan/group of fans within the chassis, is operated over a range of static pressure differential across the server. Experiments were conducted to observe the steady-state temperatures of CPUs and fan power consumption. Furthermore, the server fan speed control scheme’s transient response to a typical peak in IT computational workload while operating at negative pressure differentials across the server is reported. The effects of the internal air flow paths within the chassis is studied through experimental testing and simulations for flow visualization. The results indicate that at higher positive differential pressures across the server, increasing server fans speeds will have minimal impact on the cooling of the system. On the contrary, at lower, negative differential pressure server fan power becomes strongly dependent on operating pressure differential. More importantly, it is shown that an imbalance of flow impedances in internal airflow paths and fan control logic can onset recirculation of exhaust air within the server. For accurate prediction of airflow in cases where negative pressure differential exists, this study proposes an extended fan performance curve instead of a regular fan performance curve to be applied as a fan boundary condition for Computational Fluid Dynamics simulations. 
    more » « less
  2. Over the past few years, there has been an ever increasing rise in energy consumption by IT equipment in Data Centers. Thus, the need to minimize the environmental impact of Data Centers by optimizing energy consumption and material use is increasing. In 2011, the Open Compute Project was started which was aimed at sharing specifications and best practices with the community for highly energy efficient and economical data centers. The first Open Compute Server was the ‘ Freedom’ Server. It was a vanity free design and was completely custom designed using minimum number of components and was deployed in a data center in Prineville, Oregon. Within the first few months of operation, considerable amount of energy and cost savings were observed. Since then, progressive generations of Open Compute servers have been introduced. Initially, the servers used for compute purposes mainly had a 2 socket architecture. In 2015, the Yosemite Open Compute Server was introduced which was suited for higher compute capacity. Yosemite has a system on a chip architecture having four CPUs per sled providing a significant improvement in performance per watt over the previous generations. This study mainly focuses on air flow optimization in Yosemite platform to improve its overall cooling performance. Commercially available CFD tools have made it possible to do the thermal modeling of these servers and predict their efficiency. A detailed server model is generated using a CFD tool and its optimization has been done to improve the air flow characteristics in the server. Thermal model of the improved design is compared to the existing design to show the impact of air flow optimization on flow rates and flow speeds which in turn affects CPU die temperatures and cooling power consumption and thus, impacting the overall cooling performance of the Yosemite platform. Emphasis is given on effective utilization of fans in the server as compared to the original design and improving air flow characteristics inside the server via improved ducting. 
    more » « less
  3. In typical data centers, the servers and IT equipment are cooled by air and almost half of total IT power is dedicated to cooling. Hybrid cooling is a combined cooling technology with both air and water, where the main heat generating components are cooled by water or water-based coolants and rest of the components are cooled by air supplied by CRAC or CRAH. Retrofitting the air-cooled servers with cold plates and pumps has the advantage over thermal management of CPUs and other high heat generating components. In a typical 1U server, the CPUs were retrofitted with cold plates and the server tested with raised coolant inlet conditions. The study showed the server can operate with maximum utilization for CPUs, DIMMs, and PCH for inlet coolant temperature from 25–45 °C following the ASHRAE guidelines. The server was also tested for failure scenarios of the pumps and fans with reducing numbers of fans and pumps. To reduce cooling power consumption at the facility level and increase air-side economizer hours, the hybrid cooled server can be operated at raised inlet air temperatures. The trade-off in energy savings at the facility level due to raising the inlet air temperatures versus the possible increase in server fan power and component temperatures is investigated. A detailed CFD analysis with a minimum number of server fans can provide a way to find an operating range of inlet air temperature for a hybrid cooled server. Changes in the model are carried out in 6SigmaET for an individual server and compared to the experimental data to validate the model. The results from this study can be helpful in determining the room level operating set points for data centers housing hybrid cooled server racks. 
    more » « less
  4. In the United States, out of the total electricity produced, 2% of it is consumed by the data center facility, and up to 40% of its energy is utilized by the cooling infrastructure to cool all the heat-generating components present inside the facility, with recent technological advancement, the trend of power consumption has increased and as a consequence of increased energy consumption is the increase in carbon footprint which is a growing concern in the industry. In air cooling, the high heat- dissipating components present inside a server/hardware must receive efficient airflow for efficient cooling and to direct the air toward the components ducting is provided. In this study, the duct present in the air-cooled server is optimized and vanes are provided to improve the airflow, and side vents are installed over the sides of the server chassis before the duct is placed to bypass some of the cool air which is entering from the front where the hard drives are present. Experiments were conducted on the Cisco C220 air-cooled server with the new duct and the bypass provided, the effects of the new duct and bypass are quantified by comparing the temperature of the components such as the Central Processing Unit (CPUs), and Platform controller hub (PCH) and the savings in terms of total fan power consumption. A 7.5°C drop in temperature is observed and savings of up to 30% in terms of fan power consumption can be achieved with the improved design compared with the standard server. 
    more » « less
  5. The role of data centers in modern life has expanded rapidly over the past decades. In addition, this expansion has resulted in a significant increase in the share of data centers in total energy consumption of the world. Thus, reliability and energy efficiency have become a common concern in data centers. Information technology equipment (ITE) and cooling infrastructure are the largest power consumer in data centers. The cooling power in a data center depends on the amount of heat dissipated by ITE. Therefore, the thermal design of the ITE impacts not only the ITE power but also affects the infrastructure power and has a significant role in the overall efficiency of data centers. This paper studies the impact of fans location and airflow balancing on the thermal performance and power of a server. A detailed computational fluid dynamic (CFD) model of the server is built, calibrated and validated using experimental test results. Next, impacts of moving fans to the rear side of the chassis on the flow rate and temperature of components are investigated. Special attention is given to controlling airflow through power supplies. 
    more » « less