NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Zperf: A Statistical Gray-Box Approach to Performance Modeling and Extrapolation for Scientific Lossy Compression

https://doi.org/10.1109/TC.2023.3257517

Wang, Jinzhen; Chen, Qi; Liu, Tong; Liu, Qing; He, Xubin (January 2023, IEEE Transactions on Computers)

Full Text Available
Locality-based transfer learning on compression autoencoder for efficient scientific data lossy compression

https://doi.org/10.1016/j.jnca.2022.103452

Wang, Nan; Liu, Tong; Wang, Jinzhen; Liu, Qing; Alibhai, Shakeel; He, Xubin (September 2022, Journal of Network and Computer Applications)

Full Text Available
Reducing the Training Overhead of the HPC Compression Autoencoder via Dataset Proportioning

Tong Liu, Shakeel Alibhai (October 2021, Proc. Of the 15th IEEE International Conference on Networking, Architecture, and Storage (NAS))
null (Ed.)
As the storage overhead of high-performance computing (HPC) data reaches into the petabyte or even exabyte scale, it could be useful to find new methods of compressing such data. The compression autoencoder (CAE) has recently been proposed to compress HPC data with a very high compression ratio; however, this machine learning-based method suffers from the major drawback of lengthy training times. In this paper, we attempt to mitigate this problem by proposing a proportioning scheme that reduces the amount of data that is used for training relative to the amount of data to be compressed. We show that this method drastically reduces the training time without, in most cases, significantly increasing the error. We further explain how this scheme can even improve the accuracy of the CAE on certain datasets. Finally, we provide some guidance on how to determine a suitable proportion of the training dataset to use in order to train the CAE for a given dataset.
more » « less
Full Text Available
High-Ratio Lossy Compression: Exploring the Autoencoder to Compress Scientific Data

https://doi.org/10.1109/TBDATA.2021.3066151

Liu, Tong; Wang, Jinzhen; Liu, Qing; Alibhai, Shakeel; Lu, Tao; He, Xubin (January 2021, IEEE Transactions on Big Data)
null (Ed.)
Full Text Available
Compression Ratio Modeling and Estimation across Error Bounds for Lossy Compression

https://doi.org/10.1109/TPDS.2019.2938503

Wang, Jinzhen; Liu, Tong; Liu, Qing; He, Xubin; Luo, Huizhang; He, Weiming (July 2020, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
EC-Fusion: An Efficient Hybrid Erasure Coding Framework to Improve Both Application and Recovery Performance in Cloud Storage Systems

https://doi.org/10.1109/IPDPS47924.2020.00029

Qiu, Han; Wu, Chentao; Li, Jie; Guo, Minyi; Liu, Tong; He, Xubin; Dong, Yuanyuan; Zhao, Yafei (May 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Nowadays erasure coding is one of the most significant techniques in cloud storage systems, which provides both quick parallel I/O processing and high capabilities of fault tolerance on massive data accesses. In these systems, triple disk failure tolerant arrays (3DFTs) is a typical configuration, which is supported by several classic erasure codes like Reed-Solomon (RS) codes, Local Reconstruction Codes (LRC), Minimum Storage Regeneration (MSR) codes, etc. For an online recovery process, the foreground application workloads and the background recovery workloads are handled simultaneously, which requires a comprehensive understanding on both two types of workload characteristics. Although several techniques have been proposed to accelerate the I/O requests of online recovery processes, they are typically unilateral due to the fact that the above two workloads are not combined together to achieve high cost-effective performance.To address this problem, we propose Erasure Codes Fusion (EC-Fusion), an efficient hybrid erasure coding framework in cloud storage systems. EC-Fusion is a combination of RS and MSR codes, which dynamically selects the appropriate code based on its properties. On one hand, for write-intensive application workloads or low risk on data loss in recovery workloads, EC-Fusion uses RS code to decrease the computational overhead and storage cost concurrently. On the other hand, for read-intensive or frequent reconstruction in workloads, MSR code is a proper choice. Therefore, a better overall application and recovery performance can be achieved in a cost-effective fashion. To demonstrate the effectiveness of EC-Fusion, several experiments are conducted in hadoop systems. The results show that, compared with the traditional hybrid erasure coding techniques, EC-Fusion accelerates the response time for application by up to 1.77×, and reduces the reconstruction time by up to 69.10%.
more » « less
Full Text Available
Exploring Transfer Learning to Reduce Training Overhead of HPC Data in Machine Learning

https://doi.org/10.1109/NAS.2019.8834723

Liu, Tong; Alibhai, Shakeel; Wang, Jinzhen; Liu, Qing; He, Xubin; Wu, Chentao (September 2019, 2019 IEEE International Conference on Networking, Architecture and Storage (NAS))

Nowadays, scientific simulations on high-performance computing (HPC) systems can generate large amounts of data (in the scale of terabytes or petabytes) per run. When this huge amount of HPC data is processed by machine learning applications, the training overhead will be significant. Typically, the training process for a neural network can take several hours to complete, if not longer. When machine learning is applied to HPC scientific data, the training time can take several days or even weeks. Transfer learning, an optimization usually used to save training time or achieve better performance, has potential for reducing this large training overhead. In this paper, we apply transfer learning to a machine learning HPC application. We find that transfer learning can reduce training time without, in most cases, significantly increasing the error. This indicates transfer learning can be very useful for working with HPC datasets in machine learning applications.
more » « less
Full Text Available
Can I/O Variability Be Reduced on QoS-Less HPC Storage Systems?

https://doi.org/10.1109/TC.2018.2881709

Huang, Dan; Liu, Qing; Choi, Jong; Podhorszki, Norbert; Klasky, Scott; Logan, Jeremy; Ostrouchov, George; He, Xubin; Wolf, Matthew (May 2019, IEEE Transactions on Computers)

Full Text Available
Reference-Counter Aware Deduplication in Erasure-Coded Distributed Storage System

https://doi.org/10.1109/NAS.2018.8515697

Liu, Tong; He, Xubin; Alibhai, Shakeel; Wu, Chentao (October 2018, 2018 IEEE International Conference on Networking, Architecture and Storage (NAS))

In modern distributed storage systems, space efficiency and system reliability are two major concerns. As a result, contemporary storage systems often employ data deduplication and erasure coding to reduce the storage overhead and provide fault tolerance, respectively. However, little work has been done to explore the relationship between these two techniques. In this paper, we propose Reference-counter Aware Deduplication (RAD), which employs the features of deduplication into erasure coding to improve garbage collection performance when deletion occurs. RAD wisely encodes the data according to the reference counter, which is provided by the deduplication level and thus reduces the encoding overhead when garbage collection is conducted. Further, since the reference counter also represents the reliability levels of the data chunks, we additionally made some effort to explore the trade-offs between storage overhead and reliability level among different erasure codes. The experiment results show that RAD can effectively improve the GC performance by up to 24.8% and the reliability analysis shows that, with certain data features, RAD can provide both better reliability and better storage efficiency compared to the traditional Round- Robin placement.
more » « less
Full Text Available

Search for: All records