skip to main content


Title: Exploring Lossy Compression of Gene Expression Matrices
Gene Expression Matrices (GEMs) are a fundamental data type in the genomics domain. As the size and scope of genomics experiments increase, researchers are struggling to process large GEMs through downstream workflows with currently accepted practices. In this paper, we propose a methodology to reduce the size of GEMs using multiple approaches. Our method partitions data into discrete fields based on data type and employs state-of-the-art lossless and lossy compression algorithms to reduce the input data size. This work explores a variety of lossless and lossy compression methods to determine which methods work the best for each component of a GEM. We evaluate the accuracy of the compressed GEMs by running them through the Knowledge Independent Network Construction (KINC) workflow and comparing the quality of the resulting gene co-expression network with a lossless control to verify result fidelity. Results show that utilizing a combination of lossy and lossless compression results in compression ratios up to 9.77× on a Yeast GEM, while still preserving the biological integrity of the data. Usage of the compression methodology on the Cancer Cell Line Encyclopedia(CCLE) GEM resulted in compression ratios up to 9.26×. By using this methodology, researchers in the Genomics domain may be able to process previously inaccessible GEMs while realizing significant reduction in computational costs.  more » « less
Award ID(s):
1659300 1910197
NSF-PAR ID:
10132572
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2019 IEEE/ACM 5th International Workshop on Data Analysis and Reduction for Big Scientific Data (DRBSD-5)
Page Range / eLocation ID:
28 to 34
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Gene co-expression networks (GCNs) are constructed from Gene Expression Matrices (GEMs) in a bottom up approach where all gene pairs are tested for correlation within the context of the input sample set. This approach is computationally intensive for many current GEMs and may not be scalable to millions of samples. Further, traditional GCNs do not detect non-linear relationships missed by correlation tests and do not place genetic relationships in a gene expression intensity context. In this report, we propose EdgeScaping, which constructs and analyzes the pairwise gene intensity network in a holistic, top down approach where no edges are filtered. EdgeScaping uses a novel technique to convert traditional pairwise gene expression data to an image based format. This conversion not only performs feature compression, making our algorithm highly scalable, but it also allows for exploring non-linear relationships between genes by leveraging deep learning image analysis algorithms. Using the learned embedded feature space we implement a fast, efficient algorithm to cluster the entire space of gene expression relationships while retaining gene expression intensity. Since EdgeScaping does not eliminate conventionally noisy edges, it extends the identification of co-expression relationships beyond classically correlated edges to facilitate the discovery of novel or unusual expression patterns within the network. We applied EdgeScaping to a human tumor GEM to identify sets of genes that exhibit conventional and non-conventional interdependent non-linear behavior associated with brain specific tumor sub-types that would be eliminated in conventional bottom-up construction of GCNs. Edgescaping source code is available at https://github.com/bhusain/EdgeScaping under the MIT license. 
    more » « less
  2. High-performance computing (HPC) systems that run scientific simulations of significance produce a large amount of data during runtime. Transferring or storing such big datasets causes a severe I/O bottleneck and a considerable storage burden. Applying compression techniques, particularly lossy compressors, can reduce the size of the data and mitigate such overheads. Unlike lossless compression algorithms, error-controlled lossy compressors could significantly reduce the data size while respecting the user-defined error bound. DCTZ is one of the transform-based lossy compressors with a highly efficient encoding and purpose-built error control mechanism that accomplishes high compression ratios with high data fidelity. However, since DCTZ quantizes the DCT coefficients in the frequency domain, it may only partially control the relative error bound defined by the user. In this paper, we aim to improve the compression quality of DCTZ. Specifically, we propose a preconditioning method based on level offsetting and scaling to control the magnitude of input of the DCTZ framework, thereby enforcing stricter error bounds. We evaluate the performance of our method in terms of compression ratio and rate distortion with real-world HPC datasets. Our experimental result shows that our method can achieve a higher compression ratio than other state-of-the-art lossy compressors with a tighter error bound while precisely guaranteeing the user-defined error bound. 
    more » « less
  3. As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x-38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks. 
    more » « less
  4. As the scale and complexity of high-performance computing (HPC) systems keep growing, data compression techniques are often adopted to reduce the data volume and processing time. While lossy compression becomes preferable to a lossless one because of the potential benefit of generating a high compression ratio, it would lose its worth the effort without finding an optimal balance between volume reduction and information loss. Among many lossy compression techniques, transform-based lossy algorithms utilize spatial redundancy better. However, the transform-based lossy compressor has received relatively less attention because there is a lack of understanding of its compression performance on scientific data sets. The insight of this paper is that, in transform-based lossy compressors, quantifying dominant coefficients at the block level reveals the right balance, potentially impacting overall compression ratios. Motivated by this, we characterize three transformation-based lossy compression mechanisms with different information compaction methods using the statistical features that capture data characteristics. And then, we build several prediction models using the statistical features and the characteristics of dominant coefficients and evaluate the effectiveness of each model using six HPC datasets from three production-level simulations at scale. Our results demonstrate that the random forest classifier captures the behavior of dominant coefficients precisely, achieving nearly 99% of prediction accuracy. 
    more » « less
  5. Valencia, Alfonso (Ed.)
    Abstract Motivation Nanopore sequencing provides a real-time and portable solution to genomic sequencing, enabling better assembly, structural variant discovery and modified base detection than second generation technologies. The sequencing process generates a huge amount of data in the form of raw signal contained in fast5 files, which must be compressed to enable efficient storage and transfer. Since the raw data is inherently noisy, lossy compression has potential to significantly reduce space requirements without adversely impacting performance of downstream applications. Results We explore the use of lossy compression for nanopore raw data using two state-of-the-art lossy time-series compressors, and evaluate the tradeoff between compressed size and basecalling/consensus accuracy. We test several basecallers and consensus tools on a variety of datasets at varying depths of coverage, and conclude that lossy compression can provide 35–50% further reduction in compressed size of raw data over the state-of-the-art lossless compressor with negligible impact on basecalling accuracy (≲0.2% reduction) and consensus accuracy (≲0.002% reduction). In addition, we evaluate the impact of lossy compression on methylation calling accuracy and observe that this impact is minimal for similar reductions in compressed size, although further evaluation with improved benchmark datasets is required for reaching a definite conclusion. The results suggest the possibility of using lossy compression, potentially on the nanopore sequencing device itself, to achieve significant reductions in storage and transmission costs while preserving the accuracy of downstream applications. Availabilityand implementation The code is available at https://github.com/shubhamchandak94/lossy_compression_evaluation. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less