NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MANA-2.0: A Future-Proof Design for Transparent Checkpointing of MPI at Scale

https://doi.org/10.1109/SCWS55283.2021.00019

Xu, Yao; Zhao, Zhengji; Garg, Rohan; Khetawat, Harsh; Hartman-Baker, Rebecca; Cooperman, Gene (November 2021, 2021 SC Workshops Supplementary Proceedings (SCWS))

MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.
more » « less
Full Text Available
Improving scalability and reliability of MPI-agnostic transparent checkpointing for production workloads at NERSC

Chouhan, Prashant Singh; Khetawat, Harsh; Resnik, Neil; Jain, Twinkle; Garg, Rohan; Cooperman, Gene; Hartman-Baker, Rebecca; Zhao, Zhengji (February 2021, First International Symposium on Checkpointing for Supercomputing (SuperCheck21))

Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get transparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSC's diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSC's production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike.
more » « less
Full Text Available
Checkpointing SPAdes for Metagenome Assembly: Transparency versus Performance in Production

Jain, Twinkle; Wang, Jie (February 2021, First Int. Symp. on Checkpointing for Supercomputing (SuperCheck21))

The SPAdes assembler for metagenome assembly is a long-running application commonly used at the NERSC supercomputing site. However, NERSC, like many other sites, has a 48-hour limit on resource allocations. The solution is to chain together multiple resource allocations in a single run, using checkpoint-restart. This case study provides insights into the "pain points" in applying a well-known checkpointing package (DMTCP: Distributed MultiThreaded CheckPointing) to long-running production workloads of SPAdes. This work has exposed several bugs and limitations of DMTCP, which were fixed to support the large memory and fragmented intermediate files of SPAdes. But perhaps more interesting for other applications, this work reveals a tension between the transparency goals of DMTCP and performance concerns due to an I/O bottleneck during the checkpointing process when supporting large memory and many files. Suggestions are made for overcoming this I/O bottleneck, which provides important "lessons learned" for similar applications.
more » « less
Full Text Available
Transparent Checkpointing for OpenGL Applications on GPUs

Hou, David; Gan, Jun; Li, Yue; El Idrissi Yazami, Younes; Jain, Twinkle (February 2021, First International Symposium on Checkpointing for Supercomputing (SuperCheck21))

This work presents transparent checkpointing of OpenGL applications, refining the split-process technique[1] for application in GPU-based 3D graphics. The split-process technique was earlier applied to checkpointing MPI and CUDA programs, enabling reinitialization of driver libraries. The presented design targets practical, checkpoint-package agnostic checkpointing of OpenGL applications. An early prototype is demonstrated on Autodesk Maya. Maya is a complex proprietary media-creation software suite used with large-scale rendering hardware for CGI (Computer-Generated Animation). Transparent checkpointing of Maya provides critically-needed fault tolerance, since Maya is prone to crash when artists use some of its bleeding-edge components. Artists then lose hours of work in re-creating their complex environment.
more » « less
Full Text Available
Deploying Checkpoint/Restart for Production Workloads at NERSC

Zhengji Zhao, Rebecca Hartman-Baker (November 2020, International Conference for High Performance Computing Networking Storage and Analysis)
null (Ed.)
Full Text Available
CRAC: checkpoint-restart architecture for CUDA with streams and UVM

Jain, Twinkle; Cooperman, Gene (November 2020, International Conference for High Performance Computing Networking Storage and Analysis)
null (Ed.)
Full Text Available
Docker Container Deployment in Distributed Fog Infrastructures with Checkpoint/Restart

https://doi.org/10.1109/MobileCloud48802.2020.00016

Ahmed, Arif; Mohan, Apoorve; Cooperman, Gene; Pierre, Guillaume (August 2020, IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, MobileCloud (MobileCloud'20))
null (Ed.)
Full Text Available
Towards Non-Intrusive Software Introspection and Beyond

https://doi.org/10.1109/IC2E48712.2020.00025

Mohan, Apoorve; Nadgowda, Shripad; Pipaliya, Bhautik; Varma, Sona; Suneja, Sahil; Isci, Canturk; Cooperman, Gene; Desnoyers, Peter; Krieger, Orran; Turk, Ata (April 2020, IEEE International Conference on Cloud Engineering (IC2E))
null (Ed.)
Full Text Available
Sthread: In-Vivo Model Checking of Multithreaded Programs

https://doi.org/10.22152/programming-journal.org/2020/4/13

Cooperman, Gene; Quinson, Martin (February 2020, The Art, Science, and Engineering of Programming)
null (Ed.)
Full Text Available
Job migration in HPC clusters by means of checkpoint/restart

https://doi.org/10.1007/s11227-019-02857-y

Rodríguez-Pascual, Manuel; Cao, Jiajun; Moríñigo, José A.; Cooperman, Gene; Mayo-García, Rafael (October 2019, The Journal of Supercomputing)

Full Text Available

« Prev Next »

Search for: All records