NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Generalizing Reuse Patterns for Efficient DNN on Microcontrollers

https://doi.org/10.1145/3676641.3716257

Liu, Jiesong; Ren, Bin; Shen, Xipeng (March 2025, ACM)

Full Text Available
SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

https://doi.org/10.1145/3620666.3651384

Niu, Wei; Sanim, Md_Musfiqur Rahman; Shu, Zhihao; Guan, Jiexiong; Shen, Xipeng; Yin, Miao; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
SoD ² : Statically Optimizing Dynamic Deep Neural Network Execution

https://doi.org/10.1145/3617232.3624869

Niu, Wei; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
Survey: Exploiting Data Redundancy for Optimization of Deep Learning

https://doi.org/10.1145/3564663

Chen, Jou-An; Niu, Wei; Ren, Bin; Wang, Yanzhi; Shen, Xipeng (October 2023, ACM Computing Surveys)

Data redundancy is ubiquitous in the inputs and intermediate results of Deep Neural Networks (DNN) . It offers many significant opportunities for improving DNN performance and efficiency and has been explored in a large body of work. These studies have scattered in many venues across several years. The targets they focus on range from images to videos and texts, and the techniques they use to detect and exploit data redundancy also vary in many aspects. There is not yet a systematic examination and summary of the many efforts, making it difficult for researchers to get a comprehensive view of the prior work, the state of the art, differences and shared principles, and the areas and directions yet to explore. This article tries to fill the void. It surveys hundreds of recent papers on the topic, introduces a novel taxonomy to put the various techniques into a single categorization framework, offers a comprehensive description of the main methods used for exploiting data redundancy in improving multiple kinds of DNNs on data, and points out a set of research opportunities for future exploration.
more » « less
Full Text Available
Pruning Parameterization with Bi-level Optimization for Efficient Semantic Segmentation on the Edge

Yang, Changdi; Zhao, Pu; Li, Yanyu; Niu, Wei; Guan, Jiexiong; Tang, Hao; Qin, Minghai; Ren, Bin; Lin, Xue; Wang, Yanzhi (June 2023, The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR))

With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the fullattention mechanism usually consume a large number of computational resources, leading to difficulties for realtime inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference.
more » « less
Full Text Available
iQAN: Fast and Accurate Vector Search with Efficient Intra-Query Parallelism on Multi-Core Architectures

https://doi.org/10.1145/3572848.3577527

Peng, Zhen; Zhang, Minjia; Li, Kai; Jin, Ruoming; Ren, Bin (February 2023, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming)

Vector search has drawn a rapid increase of interest in the research community due to its application in novel AI applications. Maximizing its performance is essential for many tasks but remains preliminary understood. In this work, we investigate the root causes of the scalability bottleneck of using intra-query parallelism to speedup the state-of-the-art graph-based vector search systems on multi-core architectures. Our in-depth analysis reveals several scalability challenges from both system and algorithm perspectives. Based on the insights, we propose iQAN, a parallel search algorithm with a set of optimizations that boost convergence, avoid redundant computations, and mitigate synchronization overhead. Our evaluation results on a wide range of real-world datasets show that iQAN achieves up to 37.7× and 76.6× lower latency than state-of-the-art sequential baselines on datasets ranging from a million to a hundred million datasets. We also show that iQAN achieves outstanding scalability as the graph size or the accuracy target increases, allowing it to outperform the state-of-the-art baseline on two billion-scale datasets by up to 16.0× with up to 64 cores.
more » « less
End-to-End LU Factorization of Large Matrices on GPUs

https://doi.org/10.1145/3572848.3577486

Xia, Yang; Jiang, Peng; Agrawal, Gagan; Ramnath, Rajiv (February 2023, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming)

LU factorization for sparse matrices is an important computing step for many engineering and scientific problems such as circuit simulation. There have been many efforts toward parallelizing and scaling this algorithm, which include the recent efforts targeting the GPUs. However, it is still challenging to deploy a complete sparse LU factorization workflow on a GPU due to high memory requirements and data dependencies. In this paper, we propose the first complete GPU solution for sparse LU factorization. To achieve this goal, we propose an out-of-core implementation of the symbolic execution phase, thus removing the bottleneck due to large intermediate data structures. Next, we propose a dynamic parallelism implementation of Kahn's algorithm for topological sort on the GPUs. Finally, for the numeric factorization phase, we increase the parallelism degree by removing the memory limits for large matrices as compared to the existing implementation approaches. Experimental results show that compared with an implementation modified from GLU 3.0, our out-of-core version achieves speedups of 1.13--32.65X. Further, our out-of-core implementation achieves a speedup of 1.2--2.2 over an optimized unified memory implementation on the GPU. Finally, we show that the optimizations we introduce for numeric factorization turn out to be effective.
more » « less
Full Text Available
Towards Real-Time Segmentation on the Edge

Li, Yanyu; Yang, Changdi; Zhao, Pu; Yuan, Geng; Niu, Wei; Guan, Jiexiong; Tang, Hao; Qin, Minghai; Ren, Bin; Lin, Xue; et al (February 2023, AAAI'23: The Thirty-Seventh AAAI Conference on Artificial Intelligence)

There have been many recent attempts to extend the successes of convolutional neural networks (CNNs) from 2-dimensional (2D) image classification to 3-dimensional (3D) video recognition by exploring 3D CNNs. Considering the emerging growth of mobile or Internet of Things (IoT) market, it is essential to investigate the deployment of 3D CNNs on edge devices. Previous works have implemented standard 3D CNNs (C3D) on hardware platforms, however, they have not exploited model compression for acceleration of inference. This work proposes a hardware-aware pruning approach that can fully adapt to the loop tiling technique of FPGA design and is applied onto a novel 3D network called R(2+1)D. Leveraging the powerful ADMM, the proposed pruning method achieves simultaneous high accuracy and significant acceleration of computation on FPGA. With layer-wise pruning rates up to 10× and negligible accuracy loss, the pruned model is implemented on a Xilinx ZCU102 FPGA board, where the pruned model achieves 2.6× speedup compared with the unpruned version, and 2.3× speedup and 2.3× power efficiency improvement compared with state-of-the-art FPGA implementation of C3D.
more » « less
Full Text Available
Towards Socially Acceptable Food Type Recognition

https://doi.org/10.1109/MSN57253.2022.00110

Wang, Junjie; Guan, Jiexiong; Hong, Y. Alicia; Xue, Hong; Wang, Shuangquan; Liu, Zhenming; Ren, Bin; Zhou, Gang (December 2022, IEEE)

Automatic food type recognition is an essential task of dietary monitoring. It helps medical professionals recognize a user’s food contents, estimate the amount of energy intake, and design a personalized intervention model to prevent many chronic diseases, such as obesity and heart disease. Various wearable and mobile devices are utilized as platforms for food type recognition. However, none of them has been widely used in our daily lives and, at the same time, socially acceptable enough for continuous wear. In this paper, we propose a food type recognition method that takes advantage of Airpods Pro, a pair of widely used wireless in-ear headphones designed by Apple, to recognize 20 different types of food. As far as we know, we are the first to use this socially acceptable commercial product to recognize food types. Audio and motion sensor data are collected from Airpods Pro. Then 135 representative features are extracted and selected to construct the recognition model using the lightGBM algorithm. A real-world data collection is conducted to comprehensively evaluate the performance of the proposed method for seven human subjects. The results show that the average f1-score reaches 94.4% for the ten-fold cross- validation test and 96.0% for the self-evaluation test.
more » « less
Full Text Available
SparCL: Sparse Continual Learning on the Edge

Wang, Zifeng; Zhan, Zheng; Gong, Yifan; Yuan, Geng; Niu, Wei; Jian, Tong; Ren, Bin; Ioannidis, Stratis; Wang, Yanzhi; Dy, Jennifer (December 2022, 2022 Conference on Neural Information Processing Systems)

Existing work in continual learning (CL) focuses on mitigating catastrophic forgetting, i.e., model performance deterioration on past tasks when learning a new task. However, the training efficiency of a CL system is under-investigated, which limits the real-world application of CL systems under resource-limited scenarios. In this work, we propose a novel framework called Sparse Continual Learning(SparCL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates. Each of them not only improves efficiency, but also further mitigates catastrophic forgetting. SparCL consistently improves the training efficiency of existing state-of-the-art (SOTA) CL methods by at most 23X less training FLOPs, and, surprisingly, further improves the SOTA accuracy by at most 1.7%. SparCL also outperforms competitive baselines obtained from adapting SOTA sparse training methods to the CL setting in both efficiency and accuracy. We also evaluate the effectiveness of SparCL on a real mobile phone, further indicating the practical potential of our method.
more » « less
Full Text Available

« Prev Next »

Search for: All records