Black-box statistical prediction of lossy compression ratios for scientific data

Underwood, Robert  (ORCID:000000021464729X); Bessac, Julie  (ORCID:0000000164072423); Krasowska, David  (ORCID:0000000154114722); Calhoun, Jon_C  (ORCID:0000000171914422); Di, Sheng  (ORCID:0000000273395256); Cappello, Franck  (ORCID:0000000278903934)

doi:10.1177/10943420231179417

Citation Details

Black-box statistical prediction of lossy compression ratios for scientific data

Lossy compressors are increasingly adopted in scientific research, tackling volumes of data from experiments or parallel numerical simulations and facilitating data storage and movement. In contrast with the notion of entropy in lossless compression, no theoretical or data-based quantification of lossy compressibility exists for scientific data. Users rely on trial and error to assess lossy compression performance. As a strong data-driven effort toward quantifying lossy compressibility of scientific datasets, we provide a statistical framework to predict compression ratios of lossy compressors. Our method is a two-step framework where (i) compressor-agnostic predictors are computed and (ii) statistical prediction models relying on these predictors are trained on observed compression ratios. Proposed predictors exploit spatial correlations and notions of entropy and lossyness via the quantized entropy. We study 8+ compressors on 6 scientific datasets and achieve a median percentage prediction error less than 12%, which is substantially smaller than that of other methods while achieving at least a 8.8× speedup for searching for a specific compression ratio and 7.8× speedup for determining the best compressor out of a collection. more »

Award ID(s):: 2018069 2104023 2003709 1943114 1910197

PAR ID:: 10422451

Author(s) / Creator(s):: Underwood, Robert ; Bessac, Julie ; Krasowska, David ; Calhoun, Jon_C ; Di, Sheng ; Cappello, Franck

Publisher / Repository:: SAGE Publications

Date Published:: 2023-06-14

Journal Name:: The International Journal of High Performance Computing Applications

Volume:: 37

Issue:: 3-4

ISSN:: 1094-3420

Format(s):: Medium: X Size: p. 412-433

Size(s):: p. 412-433

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1177/10943420231179417

More Like this