skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Parallel Application Power and Performance Prediction Modeling Using Simulation
High performance computing (HPC) system runs compute-intensive parallel applications requiring large number of nodes. An HPC system consists of heterogeneous computer architecture nodes, including CPUs, GPUs, field programmable gate arrays (FPGAs), etc. Power capping is a method to improve parallel application performance subject to variable power constraints. In this paper, we propose a parallel application power and performance prediction simulator. We present prediction model to predict application power and performance for unknown power-capping values considering heterogeneous computing architecture. We develop a job scheduling simulator based on parallel discrete-event simulation engine. The simulator includes a power and performance prediction model, as well as a resource allocation model. Based on real-life measurements and trace data, we show the applicability of our proposed prediction model and simulator.  more » « less
Award ID(s):
2300124
PAR ID:
10410845
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2021 Winter Simulation Conference (WSC)
Page Range / eLocation ID:
1 to 12
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load-balancing techniques should be transparent to client applications. To address these issues, we proposeTarazu, an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability ofTarazuin large-scale storage environments. The results show improvements in load balancing and read performance of up to 33% and 43%, respectively, compared to the state-of-the-art. 
    more » « less
  2. The HPC industry is inexorably moving towards an era of extremely heterogeneous architectures, with more devices configured on any given HPC platform and potentially more kinds of devices, some of them highly specialized. Writing a separate code suitable for each target system for a given HPC application is not practical. The better solution is to use directive-based parallel programming models such as OpenMP. OpenMP provides a number of options for offloading a piece of code to devices like GPUs. To select the best option from such options during compilation, most modern compilers use analytical models to estimate the cost of executing the original code and the different offloading code variants. Building such an analytical model for compilers is a difficult task that necessitates a lot of effort on the part of a compiler engineer. Recently, machine learning techniques have been successfully applied to build cost models for a variety of compiler optimization problems. In this paper, we present COMPOFF, a cost model which uses the multi-layer perceptrons to statically estimates the Cost of OpenMP OFFloading. We used six different transformations on a parallel code of Wilson Dslash Operator to support GPU offloading, and we predicted their cost of execution on different GPUs using COMPOFF during compile time. Our results show that this model can predict offloading costs with a root mean squared error in prediction of less than 0.5 seconds. Our preliminary findings indicate that this work will make it much easier and faster for scientists and compiler developers to port legacy HPC applications that use OpenMP to new heterogeneous computing environments. 
    more » « less
  3. High-spatial-resolution satellite imagery enables transformational opportunities to observe, map, and document the micro-topographic transitions occurring in Arctic polygonal tundra at multiple spatial and temporal frequencies. Knowledge discovery through artificial intelligence, big imagery, and high-performance computing (HPC) resources is just starting to be realized in Arctic permafrost science. We have developed a novel high-performance image-analysis framework—Mapping Application for Arctic Permafrost Land Environment (MAPLE)—that enables the integration of operational-scale GeoAI capabilities into Arctic permafrost modeling. Interoperability across heterogeneous HPC systems and optimal usage of computational resources are key design goals of MAPLE. We systematically compared the performances of four different MAPLE workflow designs on two HPC systems. Our experimental results on resource utilization, total time to completion, and overhead of the candidate designs suggest that the design of an optimal workflow largely depends on the HPC system architecture and underlying service-unit accounting model. 
    more » « less
  4. null (Ed.)
    Deployed in 2015, Discovery is New Mexico State University’s commonly-available High-Performance Computing (HPC) cluster. The deployment of Discovery was initiated by Information and Communication Technologies (ICT) employees from the Systems Administration group who wanted to help researchers run their computations on a more powerful system than one they had sitting in their offices. Over the years, the cluster has grown 6 times, and as of March 2021 has 52 compute nodes, 1480 CPU cores, 17 Terabytes of RAM, 30 GPUs, and 1.8 Petabytes of usable storage. Discovery’s hardware is acquired using a combination of university funds, condo-model based funds, and grant funds, causing Discovery to be a heterogeneous system containing several CPU generations. This paper discusses our growth and administration experiences on this heterogeneous system, as well as our outreach and contribution to the HPC community. 
    more » « less
  5. —Exascale computing enables unprecedented, detailed and coupled scientific simulations which generate data on the order of tens of petabytes. Due to large data volumes, lossy compressors become indispensable as they enable better compression ratios and runtime performance than lossless compressors. Moreover, as (high-performance computing) HPC systems grow larger, they draw power on the scale of tens of megawatts. Data motion is expensive in time and energy. Therefore, optimizing compressor and data I/O power usage is an important step in reducing energy consumption to meet sustainable computing goals and stay within limited power budgets. In this paper, we explore efficient power consumption gains for the SZ and ZFP lossy compressors and data writing on a cloud HPC system while varying the CPU frequency, scientific data sets, and system architecture. Using this power consumption data, we construct a power model for lossy compression and present a tuning methodology that reduces energy overhead of lossy compressors and data writing on HPC systems by 14.3% on average. We apply our model and find 6.5 kJs, or 13%, of savings on average for 512GB I/O. Therefore, utilizing our model results in more energy efficient lossy data compression and I/O. 
    more » « less