Energy-efficient localised rollback via data flow analysis and frequency scaling

Dichev, Kiril; Cameron, Kirk; Nikolopoulos, Dimitrios S.

doi:10.1145/3236367.3236379

Citation Details

Energy-efficient localised rollback via data flow analysis and frequency scaling

Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data flow rollback (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data flow graphs. We demonstrate the benefits of DFR for an MPI stencil code by localising rollback, and then reduce energy consumption by 10-12% on idling nodes via frequency scaling. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n2 for a process count n. more »

Award ID(s):: 1838271 1939076

PAR ID:: 10192062

Author(s) / Creator(s):: Dichev, Kiril; Cameron, Kirk; Nikolopoulos, Dimitrios S.

Date Published:: 2018-09-23

Journal Name:: EuroMPI'18: Proceedings of the 25th European MPI Users' Group Meeting

Volume:: 25

Issue:: 11

Page Range / eLocation ID:: 1 to 11

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3236367.3236379

More Like this