NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs

https://doi.org/10.1145/3577193.3593717

Zhang, Chengming; Smith, Shaden; Sun, Baixi; Tian, Jiannan; Soifer, Jonathan; Yu, Xiaodong; Song, Shuaiwen Leon; He, Yuxiong; Tao, Dingwen (June 2023, ACM)
HBMax: Optimizing Memory Efficiency for Parallel Influence Maximization on Multicore Architectures

Chen, Xinyu; Minutoli, Marco; Tian, Jiannan; Halappanavar, Mahantesh; Kalyanaraman, Ananth; Tao, Dingwen (October 2022, The 31st International Conference on Parallel Architectures and Compilation Techniques (PACT 2022))

Influence maximization aims to select k most-influential vertices or seeds in a network, where influence is defined by a given diffusion process. Although computing optimal seed set is NP-Hard, efficient approximation algorithms exist. However, even state-of-the-art parallel implementations are limited by a sampling step that incurs large memory footprints. This in turn limits the problem size reach and approximation quality. In this work, we study the memory footprint of the sampling process collecting reverse reachability information in the IMM (Influence Maximization via Martingales) algorithm over large real-world social networks. We present a memory-efficient optimization approach (called HBMax) based on Ripples, a state-of-the-art multi-threaded parallel influence maximization solution. Our approach, HBMax, uses a portion of the reverse reachable (RR) sets collected by the algorithm to learn the characteristics of the graph. Then, it compresses the intermediate reverse reachability information with Huffman coding or bitmap coding, and queries on the partially decoded data, or directly on the compressed data to preserve the memory savings obtained through compression. Considering a NUMA architecture, we scale up our solution on 64 CPU cores and reduce the memory footprint by up to 82.1% with average 6.3% speedup (encoding overhead is offset by performance gain from memory reduction) without loss of accuracy. For the largest tested graph Twitter7 (with 1.4 billion edges), HBMax achieves 5.9× compression ratio and 2.2× speedup.
more » « less
Full Text Available
H-GCN: A Graph Convolutional Network Accelerator on Versal ACAP Architecture

https://doi.org/10.1109/FPL57034.2022.00040

Zhang, Chengming; Geng, Tong; Guo, Anqi; Tian, Jiannan; Herbordt, Martin; Li, Ang; Tao, Dingwen (August 2022, 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL 2022))

Graph Neural Networks (GNNs) have drawn tremendous attention due to their unique capability to extend Machine Learning (ML) approaches to applications broadly-defined as having unstructured data, especially graphs. Compared with other Machine Learning (ML) modalities, the acceleration of Graph Neural Networks (GNNs) is more challenging due to the irregularity and heterogeneity derived from graph typologies. Existing efforts, however, have focused mainly on handling graphs’ irregularity and have not studied their heterogeneity. To this end we propose H-GCN, a PL (Programmable Logic) and AIE (AI Engine) based hybrid accelerator that leverages the emerging heterogeneity of Xilinx Versal Adaptive Compute Acceleration Platforms (ACAPs) to achieve high-performance GNN inference. In particular, H-GCN partitions each graph into three subgraphs based on its inherent heterogeneity, and processes them using PL and AIE, respectively. To further improve performance, we explore the sparsity support of AIE and develop an efficient density-aware method to automatically map tiles of sparse matrix-matrix multiplication (SpMM) onto the systolic tensor array. Compared with state-of-the-art GCN accelerators, H-GCN achieves, on average, speedups of 1.1∼2.3×.
more » « less
Full Text Available
Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

https://doi.org/10.1109/IPDPS53621.2022.00075

Rivera, Cody; Di, Sheng; Tian, Jiannan; Yu, Xiaodong; Tao, Dingwen; Cappello, Franck (May 2022, The 36th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2022))

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step---a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.
more » « less
Full Text Available
COMET: a novel memory-efficient deep learning training framework by using error-bounded lossy compression

https://doi.org/10.14778/3503585.3503597

Jin, Sian; Zhang, Chengming; Jiang, Xintong; Feng, Yunhe; Guan, Hui; Li, Guanpeng; Song, Shuaiwen Leon; Tao, Dingwen (December 2021, Proceedings of the VLDB Endowment)

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Training wide and deep neural networks require large amounts of storage resources such as memory because the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped with very limited memory capacities due to hardware design constraints, which significantly limits the maximum batch size and hence performance speedup when training large-scale DNNs. Traditional memory saving techniques either suffer from performance overhead or are constrained by limited interconnect bandwidth or specific interconnect technology. In this paper, we propose a novel memory-efficient CNN training framework (called COMET) that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger models or to accelerate training. Our framework purposely adopts error-bounded lossy compression with a strict error-controlling mechanism. Specifically, we perform a theoretical analysis on the compression error propagation from the altered activation data to the gradients, and empirically investigate the impact of altered gradients over the training process. Based on these analyses, we optimize the error-bounded lossy compression and propose an adaptive error-bound control scheme for activation data compression. Experiments demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5X over the baseline training and 1.8X over another state-of-the-art compression-based framework, respectively, with little or no accuracy loss.
more » « less
Full Text Available
Optimizing Error-Bounded Lossy Compression for Scientific Data on GPUs

https://doi.org/10.1109/Cluster48925.2021.00047

Tian, Jiannan; Di, Sheng; Yu, Xiaodong; Rivera, Cody; Zhao, Kai; Jin, Sian; Feng, Yunhe; Liang, Xin; Tao, Dingwen; Cappello, Franck (September 2021, 2021 IEEE International Conference on Cluster Computing (CLUSTER 2021))

Full Text Available
ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via Fine-Grained Architecture-Preserving Pruning

https://doi.org/10.1145/3447818.3459988

Zhang, Chengming; Yuan, Geng; Niu, Wei; Tian, Jiannan; Jin, Sian; Zhuang, Donglin; Jiang, Zhe; Wang, Yanzhi; Ren, Bin; Song, Shuaiwen Leon; et al (June 2021, The 35th ACM International Conference on Supercomputing (ICS 2021))

Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear because of the growing demand on prediction accuracy and analysis quality. The wide and deep CNNs, however, require a large amount of computing resources and processing time. Many previous works have studied model pruning to improve inference performance, but little work has been done for effectively reducing training cost. In this paper, we propose ClickTrain: an efficient and accurate end-to-end training and pruning framework for CNNs. Different from the existing pruning-during-training work, ClickTrain provides higher model accuracy and compression ratio via fine-grained architecture-preserving pruning. By leveraging pattern-based pruning with our proposed novel accurate weight importance estimation, dynamic pattern generation and selection, and compiler-assisted computation optimizations, ClickTrain generates highly accurate and fast pruned CNN models for direct deployment without any extra time overhead, compared with the baseline training. ClickTrain also reduces the end-to-end time cost of the pruning-after-training method by up to 2.3X with comparable accuracy and compression ratio. Moreover, compared with the state-of-the-art pruning-during-training approach, ClickTrain provides significant improvements both accuracy and compression ratio on the tested CNN models and datasets, under similar limited training time.
more » « less
Full Text Available
Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

https://doi.org/10.1109/IPDPS49936.2021.00097

Tian, Jiannan; Rivera, Cody; Di, Sheng; Chen, Jieyang; Liang, Xin; Tao, Dingwen; Cappello, Franck (May 2021, The 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021))

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0x and 6.8x on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3x over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.
more » « less
Full Text Available
A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

https://doi.org/10.1145/3437801.3441597

Jin, Sian; Li, Guanpeng; Song, Shuaiwen Leon; Tao, Dingwen (February 2021, The 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2021))

DNNs are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. Traditional memory saving techniques such as data recomputation and migration either suffers from a high performance overhead or is constrained by specific interconnect technology and limited bandwidth. In this paper, we propose a novel memory-driven high performance CNN training framework that leverages error-bounded lossy compression to significantly reduce the memory requirement for training in order to allow training larger neural networks. We evaluate our design against state-of-the-art solutions with four widely-adopted CNNs and the ImangeNet dataset. Results demonstrate that our proposed framework can significantly reduce the training memory consumption by up to 13.5x and 1.8x over the baseline training and state-of-the-art framework with compression, respectively, with little or no accuracy loss. The full paper can be referred to at https://arxiv.org/abs/2011.09017.
more » « less
Full Text Available
TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs

https://doi.org/10.1016/j.jpdc.2021.02.013

Rivera, Cody; Chen, Jieyang; Xiong, Nan; Zhang, Jing; Song, Shuaiwen Leon; Tao, Dingwen (February 2021, Journal of Parallel and Distributed Computing)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records