skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 1740218

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition. 
    more » « less
  2. Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get transparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSC's diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSC's production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike. 
    more » « less
  3. The SPAdes assembler for metagenome assembly is a long-running application commonly used at the NERSC supercomputing site. However, NERSC, like many other sites, has a 48-hour limit on resource allocations. The solution is to chain together multiple resource allocations in a single run, using checkpoint-restart. This case study provides insights into the "pain points" in applying a well-known checkpointing package (DMTCP: Distributed MultiThreaded CheckPointing) to long-running production workloads of SPAdes. This work has exposed several bugs and limitations of DMTCP, which were fixed to support the large memory and fragmented intermediate files of SPAdes. But perhaps more interesting for other applications, this work reveals a tension between the transparency goals of DMTCP and performance concerns due to an I/O bottleneck during the checkpointing process when supporting large memory and many files. Suggestions are made for overcoming this I/O bottleneck, which provides important "lessons learned" for similar applications. 
    more » « less
  4. This work presents transparent checkpointing of OpenGL applications, refining the split-process technique[1] for application in GPU-based 3D graphics. The split-process technique was earlier applied to checkpointing MPI and CUDA programs, enabling reinitialization of driver libraries. The presented design targets practical, checkpoint-package agnostic checkpointing of OpenGL applications. An early prototype is demonstrated on Autodesk Maya. Maya is a complex proprietary media-creation software suite used with large-scale rendering hardware for CGI (Computer-Generated Animation). Transparent checkpointing of Maya provides critically-needed fault tolerance, since Maya is prone to crash when artists use some of its bleeding-edge components. Artists then lose hours of work in re-creating their complex environment. 
    more » « less
  5. null (Ed.)
  6. null (Ed.)
  7. null (Ed.)
  8. null (Ed.)
  9. null (Ed.)