NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Free, publicly-accessible full text available July 22, 2026
SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

https://doi.org/10.1109/CVPR52733.2024.00827

Li, Zhengang; Kang, Yan; Liu, Yuchen; Liu, Difan; Hinz, Tobias; Liu, Feng; Wang, Yanzhi (June 2024, IEEE)

Full Text Available
Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

https://doi.org/10.1145/3650200.3656622

Li, Zhengang; Lu, Alec; Xie, Yanyue; Kong, Zhenglun; Sun, Mengshu; Tang, Hao; Xue, Zhong Jia; Dong, Peiyan; Ding, Caiwen; Wang, Yanzhi; et al (May 2024, ACM)

Full Text Available
SuperFlow: A Fully-Customized RTL-to-GDS Design Automation Flow for Adiabatic Quantum-Flux-Parametron Superconducting Circuits

Xie, Yanyue; Dong, Peiyan; Yuan, Geng; Li, Zhengang; Zabihi, Masoud; Wu, Chao; Chang, Sung-En; Zhang, Xufeng; Lin, Xue; Ding, Caiwen; et al (March 2024, 2024 Design, Automation & Test in Europe Conference)

Full Text Available
SupeRBNN: Randomized Binary Neural Network Using Adiabatic Superconductor Josephson Devices

https://doi.org/10.1145/3613424.3623771

Li, Zhengang; Yuan, Geng; Yamauchi, Tomoharu; Masoud, Zabihi; Xie, Yanyue; Dong, Peiyan; Tang, Xulong; Yoshikawa, Nobuyuki; Tiwari, Devesh; Wang, Yanzhi; et al (October 2023, ACM)

Full Text Available
ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for Low-Bit DNN Training

https://doi.org/10.23919/DATE56975.2023.10137222

Chang, Sung-En; Yuan, Geng; Lu, Alec; Sun, Mengshu; Li, Yanyu; Ma, Xiaolong; Li, Zhengang; Xie, Yanyue; Qin, Minghai; Lin, Xue; et al (April 2023, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE))
HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

https://doi.org/10.1109/HPCA56546.2023.10071047

Dong, Peiyan; Sun, Mengshu; Lu, Alec; Xie, Yanyue; Liu, Kenneth; Kong, Zhenglun; Meng, Xin; Li, Zhengang; Lin, Xue; Fang, Zhenman; et al (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity

https://doi.org/10.1109/TPAMI.2021.3089687

Niu, Wei; Li, Zhengang; Ma, Xiaolong; Dong, Peiyan; Zhou, Gang; Qian, Xuehai; Lin, Xue; Wang, Yanzhi; Ren, Bin (October 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence)

It is appealing but challenging to achieve real-time deep neural network (DNN) inference on mobile devices because even the powerful modern mobile devices are considered “resource-constrained” when executing large-scale DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN weight sparsity, and it is desirable to design a new DNN weight sparsity scheme that can facilitate real-time inference on mobile devices while preserving a high sparse model accuracy. This paper designs a novel mobile inference acceleration framework GRIM that is General to both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) and that achieves Real-time execution and high accuracy, leveraging fine-grained structured sparse model Inference and compiler optimizations for Mobiles. We start by proposing a new fine-grained structured sparsity scheme through the Block-based Column-Row (BCR) pruning. Based on this new fine-grained structured sparsity, our GRIM framework consists of two parts: (a) the compiler optimization and code generation for real-time mobile inference; and (b) the BCR pruning optimizations for determining pruning hyperparameters and performing weight pruning. We compare GRIM with Alibaba MNN, TVM, TensorFlow-Lite, a sparse implementation based on CSR, PatDNN, and ESE (a representative FPGA inference acceleration framework for RNNs), and achieve up to 14.08× speedup.
more » « less
Full Text Available
ESRU: Extremely Low-Bit and Hardware-Efficient Stochastic Rounding Unit Design for Low-Bit DNN Training

Chang, Sung-En; Yuan, Geng; Lu, Alec; Sun, Mengshu; Li, Yanyu; Ma, Xiaolong; Li, Zhengang; Xie, Yanyue; Qin, Minghai; Lin, Xue; et al (January 2023, Design, Automation & Test in Europe Conference & Exhibition (DATE))

Full Text Available
Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time Mobile Acceleration

https://doi.org/10.1145/3495532

Gong, Yifan; Yuan, Geng; Zhan, Zheng; Niu, Wei; Li, Zhengang; Zhao, Pu; Cai, Yuxuan; Liu, Sijia; Ren, Bin; Lin, Xue; et al (September 2022, ACM Transactions on Design Automation of Electronic Systems)

Weight pruning is an effective model compression technique to tackle the challenges of achieving real-time deep neural network (DNN) inference on mobile devices. However, prior pruning schemes have limited application scenarios due to accuracy degradation, difficulty in leveraging hardware acceleration, and/or restriction on certain types of DNN layers. In this article, we propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations that are applicable to any type of DNN layer while achieving high accuracy and hardware inference performance. With the flexibility of applying different pruning schemes to different layers enabled by our compiler optimizations, we further probe into the new problem of determining the best-suited pruning scheme considering the different acceleration and accuracy performance of various pruning schemes. Two pruning scheme mapping methods—one -search based and the other is rule based—are proposed to automatically derive the best-suited pruning regularity and block size for each layer of any given DNN. Experimental results demonstrate that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework with up to 2.48 \( \times \) and 1.73 \( \times \) DNN inference acceleration on CIFAR-10 and ImageNet datasets without accuracy loss.
more » « less
Full Text Available

« Prev Next »

Search for: All records