Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems.
more »
« less
Template Matching Based Early Exit CNN for Energy-efficient Myocardial Infarction Detection on Low-power Wearable Devices
Myocardial Infarction (MI), also known as heart attack, is a life-threatening form of heart disease that is a leading cause of death worldwide. Its recurrent and silent nature emphasizes the need for continuous monitoring through wearable devices. The wearable device solutions should provide adequate performance while being resource-constrained in terms of power and memory. This paper proposes an MI detection methodology using a Convolutional Neural Network (CNN) that outperforms the state-of-the-art works on wearable devices for two datasets - PTB and PTB-XL, while being energy and memory-efficient. Moreover, we also propose a novel Template Matching based Early Exit (TMEX) CNN architecture that further increases the energy efficiency compared to baseline architecture while maintaining similar performance. Our baseline and TMEX architecture achieve 99.33% and 99.24% accuracy on PTB dataset, whereas on PTB-XL dataset they achieve 84.36% and 84.24% accuracy, respectively. Both architectures are suitable for wearable devices requiring only 20 KB of RAM. Evaluation of real hardware shows that our baseline architecture is 0.6x to 53x more energy-efficient than the state-of-the-art works on wearable devices. Moreover, our TMEX architecture further improves the energy efficiency by 8.12% (PTB) and 6.36% (PTB-XL) while maintaining similar performance as the baseline architecture.
more »
« less
- PAR ID:
- 10394395
- Date Published:
- Journal Name:
- Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
- Volume:
- 6
- Issue:
- 2
- ISSN:
- 2474-9567
- Page Range / eLocation ID:
- 1 to 22
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization FrameworkEfficient deployment of Deep Neural Networks (DNNs) on edge devices (i.e., FPGAs and mobile platforms) is very challenging, especially under a recent witness of the increasing DNN model size and complexity. Model compression strategies, including weight quantization and pruning, are widely recognized as effective approaches to significantly reduce computation and memory intensities, and have been implemented in many DNNs on edge devices. However, most state-of-the-art works focus on ad-hoc optimizations, and there lacks a thorough study to comprehensively reveal the potentials and constraints of different edge devices when considering different compression strategies. In this paper, we qualitatively and quantitatively compare the energy efficiency of FPGA-based and mobile-based DNN executions using mobile GPU and provide a detailed analysis. Based on the observations obtained from the analysis, we propose a unified optimization framework using block-based pruning to reduce the weight storage and accelerate the inference speed on mobile devices and FPGAs, achieving high hardware performance and energy-efficiency gain while maintaining accuracy.more » « less
-
Abstract—Human activity recognition (HAR) is a challenging area of research with many applications in human-computer interaction. With advances in artificial neural networks (ANNs), methods of HAR feature extraction from wearable sensor data have greatly improved and have increased interest in their classification using ANNs. Most prior work has only investigated the software implementations of ANN-based HAR. Here, we investigate, for the first time, two novel hardware implementations for use in resource-constrained edge devices. Through architecture exploration, we identify first a hybrid ANN we call DCLSTM incorporating the convolutional and long-short-term memory techniques. The second is a much more compact implementation WCLSTM that uses wavelet transforms (WTs) to enhance feature extraction; it can achieve even better accuracy while being smaller and simpler; it is therefore the better choice for resource-constrained applications. We present hardware implementations of these ANNs and evaluate their performance and resource utilization on the UCI HAR and WISDM datasets. Synthesis results on an FPGA platform show the superiority of the WT-assisted version in accuracy and size. Moreover, our networks achieve a better accuracy than earlier published works.more » « less
-
Although state-of-the-art (SOTA) CNNs achieve outstanding performance on various tasks, their high computation demand and massive number of parameters make it difficult to deploy these SOTA CNNs onto resource-constrained devices. Previous works on CNN acceleration utilize low-rank approximation of the original convolution layers to reduce computation cost. However, these methods are very difficult to conduct upon sparse models, which limits execution speedup since redundancies within the CNN model are not fully exploited. We argue that kernel granularity decomposition can be conducted with low-rank assumption while exploiting the redundancy within the remaining compact coefficients. Based on this observation, we propose PENNI, a CNN model compression framework that is able to achieve model compactness and hardware efficiency simultaneously by (1) implementing kernel sharing in convolution layers via a small number of basis kernels and (2) alternately adjusting bases and coefficients with sparse constraints. Experiments show that we can prune 97% parameters and 92% FLOPs on ResNet18 CIFAR10 with no accuracy loss, and achieve 44% reduction in run-time memory consumption and a 53% reduction in inference latency.more » « less
-
Algorithm-hardware Co-optimization for Energy-efficient Drone Detection on Resource-constrained FPGAConvolutional neural network (CNN)-based object detection has achieved very high accuracy; e.g., single-shot multi-box detectors (SSDs) can efficiently detect and localize various objects in an input image. However, they require a high amount of computation and memory storage, which makes it difficult to perform efficient inference on resource-constrained hardware devices such as drones or unmanned aerial vehicles (UAVs). Drone/UAV detection is an important task for applications including surveillance, defense, and multi-drone self-localization and formation control. In this article, we designed and co-optimized an algorithm and hardware for energy-efficient drone detection on resource-constrained FPGA devices. We trained an SSD object detection algorithm with a custom drone dataset. For inference, we employed low-precision quantization and adapted the width of the SSD CNN model. To improve throughput, we use dual-data rate operations for DSPs to effectively double the throughput with limited DSP counts. For different SSD algorithm models, we analyze accuracy or mean average precision (mAP) and evaluate the corresponding FPGA hardware utilization, DRAM communication, and throughput optimization. We evaluated the FPGA hardware for a custom drone dataset, Pascal VOC, and COCO2017. Our proposed design achieves a high mAP of 88.42% on the multi-drone dataset, with a high energy efficiency of 79 GOPS/W and throughput of 158 GOPS using the Xilinx Zynq ZU3EG FPGA device on the Open Vision Computer version 3 (OVC3) platform. Our design achieves 1.1 to 8.7× higher energy efficiency than prior works that used the same Pascal VOC dataset, using the same FPGA device, but at a low-power consumption of 2.54 W. For the COCO dataset, our MobileNet-V1 implementation achieved an mAP of 16.8, and 4.9 FPS/W for energy-efficiency, which is ∼ 1.9× higher than prior FPGA works or other commercial hardware platforms.more » « less
An official website of the United States government

