NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Free, publicly-accessible full text available July 22, 2026
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

https://doi.org/10.1145/3650200.3656622

Li, Zhengang; Lu, Alec; Xie, Yanyue; Kong, Zhenglun; Sun, Mengshu; Tang, Hao; Xue, Zhong Jia; Dong, Peiyan; Ding, Caiwen; Wang, Yanzhi; et al (May 2024, ACM)

Full Text Available
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

https://doi.org/10.1609/aaai.v37i7.26008

Kong, Zhenglun; Ma, Haoyu; Yuan, Geng; Sun, Mengshu; Xie, Yanyue; Dong, Peiyan; Meng, Xin; Shen, Xuan; Tang, Hao; Qin, Minghai; et al (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code is released at https://github.com/ZLKong/Tri-Level-ViT
more » « less
Full Text Available
ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for Low-Bit DNN Training

https://doi.org/10.23919/DATE56975.2023.10137222

Chang, Sung-En; Yuan, Geng; Lu, Alec; Sun, Mengshu; Li, Yanyu; Ma, Xiaolong; Li, Zhengang; Xie, Yanyue; Qin, Minghai; Lin, Xue; et al (April 2023, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE))
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

https://doi.org/10.1109/HPCA56546.2023.10071047

Dong, Peiyan; Sun, Mengshu; Lu, Alec; Xie, Yanyue; Liu, Kenneth; Kong, Zhenglun; Meng, Xin; Li, Zhengang; Lin, Xue; Fang, Zhenman; et al (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
TAAS: a timing-aware analytical strategy for AQFP-capable placement automation

https://doi.org/10.1145/3489517.3530487

Dong, Peiyan; Xie, Yanyue; Li, Hongjia; Sun, Mengshu; Chen, Olivia; Yoshikawa, Nobuyuki; Wang, Yanzhi (July 2022, Design Automation Conference (DAC))

Full Text Available
ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for Low-Bit DNN Training

Chang, Sung-En; Yuan, Geng; Lu, Alec; Sun, Mengshu; Li, Yanyu; Ma, Xiaolong; Li, Zhengang; Xie, Yanyue; Qin, Minghai; Lin, Xue; et al (January 2023, Design, Automation & Test in Europe Conference & Exhibition (DATE))

Full Text Available
Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

https://doi.org/10.1109/FPL57034.2022.00027

Li, Zhengang; Sun, Mengshu; Lu, Alec; Ma, Haoyu; Yuan, Geng; Xie, Yanyue; Tang, Hao; Li, Yanyu; Leeser, Miriam; Wang, Zhangyang; et al (August 2022, 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL))

Full Text Available
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

https://doi.org/10.1145/3490422.3502364

Sun, Mengshu; Li, Zhengang; Lu, Alec; Li, Yanyu; Chang, Sung-En; Ma, Xiaolong; Lin, Xue (January 2022, Proceedings of the 30th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA))

Full Text Available
ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization framework for FPGA

Chang, Sung-En; Li, Yanyu; Sun, Mengshu; Wang, Yanzhi; Lin, Xue (February 2021, The Fifth Workshop on Cognitive Architectures (CogArch 2021))

This work targets the commonly used FPGA (field-programmable gate array) devices as the hardware platform for DNN edge computing. We focus on DNN quantization as the main model compression technique. The novelty of this work is: We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension. The intra-layer multi-precision method can uniform the hardware configurations for different layers to reduce computation overhead and at the same time preserve the model accuracy as the inter-layer approach. Our proposed ILMPQ DNN quantization framework achieves 70.73% Top1 accuracy in ResNet-18 on the ImageNet dataset. We also validate the proposed MSP framework on two FPGA devices i.e., Xilinx XC7Z020 and XC7Z045. We achieve 3.65× speedup in end-to-end inference time on the ImageNet, comparing with the fixed-point quantization method.
more » « less
Full Text Available

« Prev Next »

Search for: All records