skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An HPC Practitioner’s Workbench for Formal Refinement Checking
HPC practitioners make use of techniques, such as parallelism and sparse data structures, that are difficult to reason about and debug. Here we explore the role of data refinement, a correct-by-construction approach, in verifying HPC applications via bounded model checking. We show how single program, multiple data (SPMD) parallelism can be modeled in Alloy, a declarative specification language, and describe common issues that arise when performing scope-complete refinement checks in this context.  more » « less
Award ID(s):
2124205
PAR ID:
10434359
Author(s) / Creator(s):
; ;
Editor(s):
Mendis, Charith; Rauchwerger, Lawrence
Date Published:
Journal Name:
Lecture notes in computer science
Volume:
13829
ISSN:
0302-9743
Page Range / eLocation ID:
5
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The study of clouds, i.e., where they occur and what are their characteristics, plays a key role in the understanding of climate change. Clustering is a common machine learning technique used in atmospheric science to classify cloud types. Many parallelism techniques e.g., MPI, OpenMP and Spark, could achieve efficient and scalable clustering of large-scale satellite observation data. In order to understand their differences, this paper studies and compares three different approaches on parallel clustering of satellite observation data. Benchmarking experiments with k-means clustering are conducted with three parallelism techniques, namely OpenMP, OpenMP+MPI, and Spark, on a HPC cluster using up to 16 nodes. 
    more » « less
  2. null (Ed.)
    Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC) infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCHECK. PFSCHECK is easy to use with low performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCHECK is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states. 
    more » « less
  3. Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many HPC workflows. This pattern introduces high I/O overheads and results in increased storage space utilization especially for workflows that need to capture the evolution of data structures with high frequency as checkpoints. In this context, many applications, such as graph pattern matching, perform sparse updates to large data structures between checkpoints. For these applications, incremental checkpointing techniques that save only the differences from one checkpoint to another can dramatically reduce the checkpoint sizes, I/O bottlenecks, and storage space utilization. However, such techniques are not without challenges: it is non-trivial to transparently determine what data has changed since a previous checkpoint and assemble the differences in a compact fashion that does not result in excessive metadata. State-of-art data reduction techniques (e.g., compression and de-duplication) have significant limitations when applied to modern HPC applications that leverage GPUs: slow at detecting the differences, generate a large amount of metadata to keep track of the differences, and ignore crucial spatiotemporal checkpoint data redundancy. This paper addresses these challenges by proposing a Merkle tree-based incremental checkpointing method to exploit GPUs’ high memory bandwidth and massive parallelism. Experimental results at scale show a significant reduction of the I/O overhead and space utilization of checkpointing compared with state-of-the-art incremental checkpointing and compression techniques. 
    more » « less
  4. Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf ™ is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons ™ Association. We present the results from the first submission round including a diverse set of some of the world’s largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization and communication scheduling enabling overall >10× (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system’s memory hierarchy and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch-sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O and network behaviour to parameterize extended roofline performance models in future rounds. 
    more » « less
  5. Developing a workforce with applied skills in the high-performance computing (HPC) field has highlighted a gap in personnel experiences needed for capabilities towards scientific purposes. The exposure provided in current academic programs, especially on an entry level is insufficient. To increase student participation HackHPC, a collaboration between the Science Gateways Community Institute, Omnibond Systems, the Texas Advanced Computing Center, and the University of Tartu was formed to address that gap through the application of events known as hackthons. Hackathons are time-bounded events during which participants form teams to work on a project that is of interest to them. Hosting a hackathon that has the desired long term outcomes involves a number of crucial decisions related to preparing, running and following up on an event. In this paper, we report on the development and refinement of the ”HackHPC Model” which includes methodologies, participants, procedures, and refined implementation of practices used to plan and host hackathon events to foster workforce development in HPC. 
    more » « less