- NSF-PAR ID:
- 10192062
- Date Published:
- Journal Name:
- EuroMPI'18: Proceedings of the 25th European MPI Users' Group Meeting
- Volume:
- 25
- Issue:
- 11
- Page Range / eLocation ID:
- 1 to 11
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel "split-process" approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.more » « less
-
MANA-2.0 is a scalable, future-proof design for transparent checkpointing of MPI-based computations. Its network transparency (“network-agnostic”) feature ensures that MANA-2.0 will provide a viable, efficient mechanism for trans-parently checkpointing MPI applications on current and future supercomputers. MANA-2.0 is an enhancement of previous work, the original MANA, which interposes MPI calls, and is a work in progress intended for production deployment. MANA-2.0 implements a series of new algorithms and features that improve MANA's scalability and reliability, enabling transparent checkpoint-restart over thousands of MPI processes. MANA-2.0 is being tested on today's Cori supercomputer at NERSC using Cray MPICH library over the Cray GNI network, but it is designed to work over any standard MPI running over an arbitrary network. Two widely-used HPC applications were selected to demonstrate the enhanced features of MANA-2.0: GROMACS, a molecular dynamics simulation code with frequent point-to-point communication, and VASP, a materials science code with frequent MPI collective communication. Perhaps the most important lesson to be learned from MANA-2.0 is a series of algorithms and data structures for library-based transformations that enable MPI-based computations over MANA-2.0 to reliably survive the checkpoint-restart transition.more » « less
-
Abstract The electromagnetic waves in the Schumann resonance (SR) frequency range (<100 Hz) radiated by natural “lightning antennas” excite the Earth‐ionosphere cavity confined between the Earth's surface and the lower ionosphere. The peak frequencies of SR are known to vary with source‐observer distance (SOD), while the daily frequency range (DFR:
f max −f min) is also indicative of the average size of thunderstorm regions. This paper provides observational evidence for these relationships based on SR frequency observations of the vertical electric (EZ) field component at Nagycenk (NCK), Hungary in Central Europe from the period 1994–2015. Variations of the peak frequencies are considered on the annual, seasonal and diurnal time scales as well as during a specific event when squall‐line formation of lightning activity in South America moves toward NCK. DFR is studied in relation to the El Niño Southern Oscillation (ENSO). Increasing area of lightning activity in mid‐high Northern hemisphere latitudes has been identified by DFR variations during the transition from warm to cold episodes of the ENSO in 1998 and 2010. The extension of the lightning area is considered as a consequence of energy released in the tropics and exported to higher latitudes with some months of delay from the end of the El Niño episodes. The frequency variations are interpreted via model calculations and supported with satellite‐based optical lightning observations (Optical Transient Detector, Geostationary Lightning Mapper). The described variations of SR peak frequencies and DFR yield information on the global/regional lightning dynamics and on this basis they have important application to climate issues as well. -
Summary Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart model has been proposed to decrease the time of failure recovery in Bulk‐Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present ER
einit , an implementation of the global‐restart model that addresses these problems. Our key idea and optimization is the co‐design of basic fault‐tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes. -
In this paper, we introduce a deep spiking delayed feedback reservoir (DFR) model to combine DFR with spiking neuros: DFRs are a new type of recurrent neural networks (RNNs) that are able to capture the temporal correlations in time series while spiking neurons are energy-efficient and biologically plausible neurons models. The introduced deep spiking DFR model is energy-efficient and has the capability of analyzing time series signals. The corresponding field programmable gate arrays (FPGA)-based hardware implementation of such deep spiking DFR model is introduced and the underlying energy-efficiency and recourse utilization are evaluated. Various spike encoding schemes are explored and the optimal spike encoding scheme to analyze the time series has been identified. To be specific, we evaluate the performance of the introduced model using the spectrum occupancy time series data in MIMO-OFDM based cognitive radio (CR) in dynamic spectrum sharing (DSS) networks. In a MIMO-OFDM DSS system, available spectrum is very scarce and efficient utilization of spectrum is very essential. To improve the spectrum efficiency, the first step is to identify the frequency bands that are not utilized by the existing users so that a secondary user (SU) can use them for transmission. Due to the channel correlation as well as users' activities, there is a significant temporal correlation in the spectrum occupancy behavior of the frequency bands in different time slots. The introduced deep spiking DFR model is used to capture the temporal correlation of the spectrum occupancy time series and predict the idle/busy subcarriers in future time slots for potential spectrum access. Evaluation results suggest that our introduced model achieves higher area under curve (AUC) in the receiver operating characteristic (ROC) curve compared with the traditional energy detection-based strategies and the learning-based support vector machines (SVMs).more » « less