ARC: An Automated Approach to Resiliency for Lossy Compressed Data via Error Correcting Codes

Fulp, Dakota; Poulos, Alexandra; Underwood, Robert; Calhoun, Jon C.

doi:10.1145/3431379.3460638

Citation Details

ARC: An Automated Approach to Resiliency for Lossy Compressed Data via Error Correcting Codes

Progress in high-performance computing (HPC) systems has led to complex applications that stress the I/O subsystem by creating vast amounts of data. Lossy compression reduces data size considerably, but a single error renders lossy compressed data unusable. This sensitivity stems from the high information content per bit in compressed data and is a critical issue as soft errors that cause bit-flips have become increasingly commonplace in HPC systems. While many works have improved lossy compressor performance, few have sought to address this critical weakness. This paper presents ARC: Automated Resiliency for Compression. Given user-defined constraints on storage, throughput, and resiliency, ARC automatically determines the optimal error-correcting code (ECC) configuration before encoding data. We conduct an extensive fault injection study to fully understand the effects of soft errors on lossy compressed data and how to best protect it. We evaluate ARC's scalability, performance, resiliency, and ease of use. We find on a 40 core node that encoding and decoding demonstrate throughput up to 3730 MB/s and 3602 MB/s. ARC also detects and corrects multi-bit errors with a tunable overhead in terms of storage and throughput. Finally, we display the ease of using ARC and how to consider a systems failure rate when determining the constraints. more »

Award ID(s):: 1910197 1943114 1633608

PAR ID:: 10294532

Author(s) / Creator(s):: Fulp, Dakota; Poulos, Alexandra; Underwood, Robert; Calhoun, Jon C.

Date Published:: 2020-06-21

Journal Name:: HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing

Page Range / eLocation ID:: 57 to 68

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3431379.3460638

More Like this