skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Application kernels: HPC resources performance monitoring and variance analysis: HPC Resources Performance Monitoring and Variance Analysis
Award ID(s):
1445806
PAR ID:
10403089
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Concurrency and Computation: Practice and Experience
Volume:
27
Issue:
17
ISSN:
1532-0626
Page Range / eLocation ID:
5238 to 5260
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions. 
    more » « less
  2. High-performance computing (HPC) resources are used in a wide range of scientic and engineering calculations. These resources have high initial and running costs. Thus, their optimal performance is crucial. There are a number of strategies to ensure the optimal state. One of them is continuous performance monitoring, where a set of applications and input parameters are executed regularly to identify performance issues proactively. Some sites hesitate to use such a strategy as it takes away the CPU cycles from actual users. The goal of this work is to identify node availability, both size- and time-wise, on busy HPC systems. Such availability spots can be used to tailor test jobs to minimize user impact. Two systems were analyzed: small - 118 nodes from the Center for Computational Research at the University at Bualo and large - 1,160 nodes from the Texas Advanced Computing Center. It was found that for days with 90% utilization and above, there are plenty of opportunities for test jobs. For example, on a small cluster, 8 nodes for 30 minutes are available for an average of 2.3 hours throughout the day. That is 9.6% of the day the scheduler has the opportunity to schedule such a job. On a large system, 32 nodes for 30 minutes were available on average 9.2 hours a day (or 38% of day). Thus, there is a space for test jobs, but it is not evident that the scheduler can benet from it, and a proper strategy must be used, for example, by lowering test job priorities. 
    more » « less
  3. null (Ed.)
    This poster presents an HPC application workflow system whose goal is to provide verifiably-reproducible HPC application performance. This system combines existing container, experiment, and data management techniques with HPC performance models, allowing it to both maximize performance reproducibility and inform users when application performance deviates from what should be expected even when running at scales or for lengths of time at which the application had never run. 
    more » « less