skip to main content

Title: Prediction for distributional outcomes in high-performance computing input/output variability

Although high-performance computing (HPC) systems have been scaled to meet the exponentially growing demand for scientific computing, HPC performance variability remains a major challenge in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management. In this article, we propose a new framework to predict performance distributions. The proposed framework is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We predict the HPC I/O distribution using the proposed method for the IOzone variability data. Data analysis results show that our framework can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our prediction results can further be used for HPC system variability monitoring and optimization. This article has online supplementary materials.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series C: Applied Statistics
Medium: X Size: p. 561-580
p. 561-580
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC) infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCHECK. PFSCHECK is easy to use with low performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCHECK is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states. 
    more » « less
  2. —Exascale computing enables unprecedented, detailed and coupled scientific simulations which generate data on the order of tens of petabytes. Due to large data volumes, lossy compressors become indispensable as they enable better compression ratios and runtime performance than lossless compressors. Moreover, as (high-performance computing) HPC systems grow larger, they draw power on the scale of tens of megawatts. Data motion is expensive in time and energy. Therefore, optimizing compressor and data I/O power usage is an important step in reducing energy consumption to meet sustainable computing goals and stay within limited power budgets. In this paper, we explore efficient power consumption gains for the SZ and ZFP lossy compressors and data writing on a cloud HPC system while varying the CPU frequency, scientific data sets, and system architecture. Using this power consumption data, we construct a power model for lossy compression and present a tuning methodology that reduces energy overhead of lossy compressors and data writing on HPC systems by 14.3% on average. We apply our model and find 6.5 kJs, or 13%, of savings on average for 512GB I/O. Therefore, utilizing our model results in more energy efficient lossy data compression and I/O. 
    more » « less
  3. Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications. 
    more » « less
  4. null (Ed.)
    In the era of data-intensive computing, large-scale applications, in both scientific and the BigData communities, demonstrate unique I/O requirements leading to a proliferation of different storage devices and software stacks, many of which have conflicting requirements. Further, new hardware technologies and system designs create a hierarchical composition that may be ideal for computational storage operations. In this article, we investigate how to support a wide variety of conflicting I/O workloads under a single storage system. We introduce the idea of a Label , a new data representation, and, we present LABIOS: a new, distributed, Label- based I/O system. LABIOS boosts I/O performance by up to 17× via asynchronous I/O, supports heterogeneous storage resources, offers storage elasticity, and promotes in situ analytics and software defined storage support via data provisioning. LABIOS demonstrates the effectiveness of storage bridging to support the convergence of HPC and BigData workloads on a single platform. 
    more » « less
  5. Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system. 
    more » « less