NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Free, publicly-accessible full text available July 22, 2026
TMModel: Modeling Texture Memory and Mobile GPU Performance to Accelerate DNN Computations

https://doi.org/10.1145/3721145.3725774

Guan, Jiexiong; Hu, Zhenqing; Antonopoulos, Christos D; Bellas, Nikolaos; Lalis, Spyros; Smirni, Evgenia; Zhou, Gang; Agrawal, Gagan; Ren, Bin (June 2025, ACM)

The demand for Deep Neural Network (DNN) execution (including both inference and training) on mobile system-on-a-chip (SoCs) has surged, driven by factors like the need for real-time latency, privacy, and reducing vendors’ costs. Mainstream mobile GPUs (e.g., Qualcomm Adreno GPUs) usually have a 2.5D L1 texture cache that offers throughput superior to that of on-chip memory. However, to date, there is limited understanding of the performance features of such a 2.5D cache, which limits the optimization potential. This paper introduces TMModel, a framework with three components: 1) a set of micro-benchmarks and a novel performance assessment methodology to characterize a non-well-documented architecture with 2D memory, 2) a complete analytical performance model configurable for different data access pattern(s), tiling size(s), and other GPU execution parameters for a given operator (and associated size and shape), and 3) a compilation framework incorporating this model and generating optimized code with low overhead. TMModel is validated both on a set of DNN kernels and for training complete models on a mobile GPU, and compared against both popular mobile DNN frameworks and another GPU performance model. Evaluation results demonstrate that TMModel outperforms all baselines, achieving 1.48 − 3.61× speedup on individual kernels and 1.83 − 66.1× speedup for end-to-end on-device training with only 0.25% − 18.5% the tuning cost of the baselines.
more » « less
Free, publicly-accessible full text available June 8, 2026
Towards Recognizing Food Types for Unseen Subjects

https://doi.org/10.1145/3696424

Guan, Jiexiong; Wang, Junjie; Niu, Wei; Peng, Zhen; Wang, Shuangquan; Liu, Zhenming; Zhou, Gang; Ren, Bin (September 2024, ACM Transactions on Computing for Healthcare)

Recognizing food types through sensor signals for unseen users remains remarkably challenging, despite extensive recent studies. The efficacy of prior machine learning techniques is dwarfed by giant variations of data collected from multiple participants, partly because users have varied chewing habits and wear sensor devices in various manners. This work treats the problem as an instance of the domain adaptation problem, where each user represents a domain. We develop the first multi-source domain adaptation (MSDA) method for food-typing recognition, which consists of three major components: stratified normalization, a multi-source domain adaptor, and adaptive ensemble learning. New techniques are developed for each component. Using a real-world dataset comprised of 15 participants, we demonstrate that our method achieves\(1.33\times\)to\(2.13\times\)improvement in accuracy compared with nine state-of-the-art MSDA baselines. Additionally, we perform an in-depth ablation study to examine the behavior of each component and confirm their efficacy.
more » « less
Full Text Available
SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

https://doi.org/10.1145/3620666.3651384

Niu, Wei; Sanim, Md_Musfiqur Rahman; Shu, Zhihao; Guan, Jiexiong; Shen, Xipeng; Yin, Miao; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
Pruning Parameterization with Bi-level Optimization for Efficient Semantic Segmentation on the Edge

Yang, Changdi; Zhao, Pu; Li, Yanyu; Niu, Wei; Guan, Jiexiong; Tang, Hao; Qin, Minghai; Ren, Bin; Lin, Xue; Wang, Yanzhi (June 2023, The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR))

With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the fullattention mechanism usually consume a large number of computational resources, leading to difficulties for realtime inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference.
more » « less
Full Text Available
Decentralized Application-Level Adaptive Scheduling for Multi-Instance DNNs on Open Mobile Devices

Sung, Hsin-Hsuan; Chen, Jou-An; Niu, Wei; Guan, Jiexiong; Ren, Bin; Shen, Xipeng (January 2023, Proceedings of the 2023 USENIX Annual Technical Conference)

As more apps embrace AI, it is becoming increasingly common that multiple Deep Neural Networks (DNN)-powered apps may run at the same time on a mobile device. This paper explores scheduling in such multi-instance DNN scenarios, on general open mobile systems (e.g., common smartphones and tablets). Unlike closed systems (e.g., autonomous driving systems) where the set of co-run apps is known beforehand, the user of an open mobile system may install or uninstall arbitrary apps at any time, and a centralized solution is subject to adoption barriers. This work proposes the first-known decentralized application-level scheduling mechanism to address the problem. By leveraging the adaptivity of Deep Reinforcement Learning, the solution is shown to make the scheduling of co-run apps converge to a Nash equilibrium point, yielding a good balance of gains among the apps. The solution moreover automatically adapts to the running environment and the underlying OS and hardware. Experiments show that the solution consistently produces significant speedups and energy savings across DNN workloads, hardware configurations, and running scenarios.
more » « less
Full Text Available
GCD ² : A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs

https://doi.org/10.1109/MICRO56248.2022.00044

Niu, Wei; Guan, Jiexiong; Shen, Xipeng; Wang, Yanzhi; Agrawal, Gagan; Ren, Bin (October 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO))

More specialized chips are exploiting available high transistor density to expose parallelism at a large scale with more intricate instruction sets. This paper reports on a compilation system GCD^2 , developed to support complex Deep Neural Network (DNN) workloads on mobile DSP chips. We observe several challenges in fully exploiting this architecture, related to SIMD width, more complex SIMD/vector instructions, and VLIW pipeline with the notion of soft dependencies. GCD^2 comprises the following contributions: 1) development of matrix layout formats that support the use of different novel SIMD instructions, 2) formulation and solution of a global optimization problem related to choosing the best instruction (and associated layout) for implementation of each operator in a complete DNN, and 3) SDA, an algorithm for packing instructions with consideration for soft dependencies. These solutions are incorporated in a complete compilation system that is extensively evaluated against other systems using 10 large DNN models. Evaluation results show that GCD^2 outperforms two product-level state-of-the-art end-to-end DNN execution frameworks (TFLite and Qualcomm SNPE) that support mobile DSPs by up to 6.0× speedup, and outperforms three established compilers (Halide, TVM, and RAKE) by up to 4.5×,3.4× and 4.0× speedup, respectively. GCD^2 is also unique in supporting, real-time execution of certain DNNs, while its implementation enables two major DNNs to execute on a mobile DSP for the first time.
more » « less
Full Text Available
Towards Socially Acceptable Food Type Recognition

https://doi.org/10.1109/MSN57253.2022.00110

Wang, Junjie; Guan, Jiexiong; Hong, Y. Alicia; Xue, Hong; Wang, Shuangquan; Liu, Zhenming; Ren, Bin; Zhou, Gang (December 2022, IEEE)

Automatic food type recognition is an essential task of dietary monitoring. It helps medical professionals recognize a user’s food contents, estimate the amount of energy intake, and design a personalized intervention model to prevent many chronic diseases, such as obesity and heart disease. Various wearable and mobile devices are utilized as platforms for food type recognition. However, none of them has been widely used in our daily lives and, at the same time, socially acceptable enough for continuous wear. In this paper, we propose a food type recognition method that takes advantage of Airpods Pro, a pair of widely used wireless in-ear headphones designed by Apple, to recognize 20 different types of food. As far as we know, we are the first to use this socially acceptable commercial product to recognize food types. Audio and motion sensor data are collected from Airpods Pro. Then 135 representative features are extracted and selected to construct the recognition model using the lightGBM algorithm. A real-world data collection is conducted to comprehensively evaluate the performance of the proposed method for seven human subjects. The results show that the average f1-score reaches 94.4% for the ten-fold cross- validation test and 96.0% for the self-evaluation test.
more » « less
Full Text Available
Towards Real-Time Segmentation on the Edge

Li, Yanyu; Yang, Changdi; Zhao, Pu; Yuan, Geng; Niu, Wei; Guan, Jiexiong; Tang, Hao; Qin, Minghai; Ren, Bin; Lin, Xue; et al (February 2023, AAAI'23: The Thirty-Seventh AAAI Conference on Artificial Intelligence)

There have been many recent attempts to extend the successes of convolutional neural networks (CNNs) from 2-dimensional (2D) image classification to 3-dimensional (3D) video recognition by exploring 3D CNNs. Considering the emerging growth of mobile or Internet of Things (IoT) market, it is essential to investigate the deployment of 3D CNNs on edge devices. Previous works have implemented standard 3D CNNs (C3D) on hardware platforms, however, they have not exploited model compression for acceleration of inference. This work proposes a hardware-aware pruning approach that can fully adapt to the loop tiling technique of FPGA design and is applied onto a novel 3D network called R(2+1)D. Leveraging the powerful ADMM, the proposed pruning method achieves simultaneous high accuracy and significant acceleration of computation on FPGA. With layer-wise pruning rates up to 10× and negligible accuracy loss, the pruned model is implemented on a Xilinx ZCU102 FPGA board, where the pruned model achieves 2.6× speedup compared with the unpruned version, and 2.3× speedup and 2.3× power efficiency improvement compared with state-of-the-art FPGA implementation of C3D.
more » « less
Full Text Available
Towards Real-Time Segmentation on the Edge

https://doi.org/10.1609/aaai.v37i2.25232

Li, Yanyu; Yang, Changdi; Zhao, Pu; Yuan, Geng; Niu, Wei; Guan, Jiexiong; Tang, Hao; Qin, Minghai; Jin, Qing; Ren, Bin; et al (February 2023, Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23))

Full Text Available

« Prev Next »

Search for: All records