skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization
Award ID(s):
1642385 1642410
PAR ID:
10078534
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Knowledge and Data Engineering
Volume:
30
Issue:
3
ISSN:
1041-4347
Page Range / eLocation ID:
544 to 558
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster. 
    more » « less
  2. null (Ed.)
  3. Communication collectives are at the heart of distributed-memory parallel algorithms and the Message Passing Interface. In parallel computing courses, students can learn about collectives not only to utilize them as building blocks to implement other algorithms, but also as exemplars for designing and analyzing efficient algorithms. We develop a visualization tool to help students understand different algorithms for collective operations as well as evaluate and analyze the algorithms' efficiencies. Our implementation is written in C++ with OpenMP and uses the Thread Safe Graphics Library. We simulate distributed-memory message passing to implement the algorithms, and the threads concurrently illustrate their local memories and message passing using a shared canvas. Our tool includes visualizations of different algorithms for Scatter, Gather, ReduceScatter, AllGather, Broadcast, Reduce, AllReduce, and AlltoAll. 
    more » « less