skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multiobjective optimization of the variability of the high-performance LINPACK solver.
Variability in the execution time of computing tasks can cause load imbalance in high-performance computing (HPC) systems. When configuring system- and application-level parameters, engineers traditionally seek configurations that will maximize the mean computational throughput. In an HPC setting, however, high-throughput configurations that do not account for performance variability could result in poor load balancing. In order to determine the effects of performance variance on computationally expensive numerical simulations, the High-Performance LINPACK solver is optimized by using multiobjective optimization to maximize the mean and minimize the standard deviation of the computational throughput on the High-Performance LINPACK benchmark. We show that specific configurations of the solver can be used to control for variability at a small sacrifice in mean throughput. We also identify configurations that result in a relatively high mean throughput, but also result in a high throughput variability.  more » « less
Award ID(s):
1838271
PAR ID:
10294428
Author(s) / Creator(s):
Date Published:
Journal Name:
WSC '20: Proceedings of the Winter Simulation Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Although high-performance computing (HPC) systems have been scaled to meet the exponentially growing demand for scientific computing, HPC performance variability remains a major challenge in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management. In this article, we propose a new framework to predict performance distributions. The proposed framework is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We predict the HPC I/O distribution using the proposed method for the IOzone variability data. Data analysis results show that our framework can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our prediction results can further be used for HPC system variability monitoring and optimization. This article has online supplementary materials. 
    more » « less
  2. Recent advances in virtualization technologies used in cloud computing offer performance that closely approaches bare-metal levels. Combined with specialized instance types and high-speed networking services for cluster computing, cloud platforms have become a compelling option for high-performance computing (HPC). However, most current batch job schedulers in HPC systems are designed for homogeneous clusters and make decisions based on limited information about jobs and system status. Scientists typically submit computational jobs to these schedulers with a requested runtime that is often over- or under-estimated. More accurate runtime predictions can help schedulers make better decisions and reduce job turnaround times. They can also support decisions about migrating jobs to the cloud to avoid long queue wait times in HPC systems. In this study, we design neural network models to predict the runtime and resource utilization of jobs on integrated cloud and HPC systems. We developed two monitoring strategies to collect job and system resource utilization data using a workload management system and a cloud monitoring service. We evaluated our models on two Department of Energy (DOE) HPC systems and Amazon Web Services (AWS). Our results show that we can predict the runtime of a job with 31–41 % mean absolute percentage error (MAPE), 14–17 seconds mean absolute value error (MAE), and 0.99 R-squared (R²) score. Having an MAE of less than a minute corresponds to 100 % accuracy since the requested time for batch jobs is always specified in hours and/or minutes 
    more » « less
  3. Abstract With the advent of modern day high-throughput technologies, the bottleneck in biological discovery has shifted from the cost of doing experiments to that of analyzing results.clubberis our automated cluster-load balancing system developed for optimizing these “big data” analyses. Its plug-and-play framework encourages re-use of existing solutions for bioinformatics problems.clubber’s goals are to reduce computation times and to facilitate use of cluster computing. The first goal is achieved by automating the balance of parallel submissions across available high performance computing (HPC) resources. Notably, the latter can be added on demand, including cloud-based resources, and/or featuring heterogeneous environments. The second goal of making HPCs user-friendly is facilitated by an interactive web interface and a RESTful API, allowing for job monitoring and result retrieval. We usedclubberto speed up our pipeline for annotating molecular functionality of metagenomes. Here, we analyzed the Deepwater Horizon oil-spill study data to quantitatively show that the beach sands have not yet entirely recovered. Further, our analysis of the CAMI-challenge data revealed that microbiome taxonomic shifts do not necessarily correlate with functional shifts. These examples (21 metagenomes processed in 172 min) clearly illustrate the importance ofclubberin the everyday computational biology environment. 
    more » « less
  4. In this article, we present a four-layer distributed simulation system and its adaptation to the Material Point Method (MPM). The system is built upon a performance portableC++programming model targeting major High-Performance-Computing (HPC) platforms. A key ingredient of our system is a hierarchical block-tile-cell sparse grid data structure that is distributable to an arbitrary number of Message Passing Interface (MPI) ranks. We additionally propose strategies for efficient dynamic load balance optimization to maximize the efficiency of MPI tasks. Our simulation pipeline can easily switch among backend programming models, including OpenMP and CUDA, and can be effortlessly dispatched onto supercomputers and the cloud. Finally, we construct benchmark experiments and ablation studies on supercomputers and consumer workstations in a local network to evaluate the scalability and load balancing criteria. We demonstrate massively parallel, highly scalable, and gigascale resolution MPM simulations of up to 1.01 billion particles for less than 323.25 seconds per frame with 8 OpenSSH-connected workstations. 
    more » « less
  5. Background and Objectives: The variability and biases in the real-world performance benchmarking of deep learning models for medical imaging compromise their trustworthiness for real-world deployment. The common approach of holding out a single fixed test set fails to quantify the variance in the estimation of test performance metrics. This study introduces NACHOS (Nested and Automated Cross-validation and Hyperparameter Optimization using Supercomputing) to reduce and quantify the variance of test performance metrics of deep learning models. Methods: NACHOS integrates Nested Cross-Validation (NCV) and Automated Hyperparameter Optimization (AHPO) within a parallelized high-performance computing (HPC) framework. NACHOS was demonstrated on a chest X-ray repository and an Optical Coherence Tomography (OCT) dataset under multiple data partitioning schemes. Beyond performance estimation, DACHOS (Deployment with Automated Cross-validation and Hyperparameter Optimization using Supercomputing) is introduced to leverage AHPO and cross-validation to build the final model on the full dataset, improving expected deployment performance. Results: The findings underscore the importance of NCV in quantifying and reducing estimation variance, AHPO in optimizing hyperparameters consistently across test folds, and HPC in ensuring computational feasibility. Conclusions: By integrating these methodologies, NACHOS and DACHOS provide a scalable, reproducible, and trustworthy framework for DL model evaluation and deployment in medical imaging. To maximize public availability, the full open-source codebase is provided at https://github.com/thepanlab/NACHOS. 
    more » « less