skip to main content


Title: Protean: An Energy-Efficient and Heterogeneous Platform for Adaptive and Hardware-Accelerated Battery-Free Computing
Battery-free and intermittently powered devices offer long lifetimes and enable deployment in new applications and environments. Unfortunately, developing sophisticated inference-capable applications is still challenging due to the lack of platform support for more advanced (32-bit) microprocessors and specialized accelerators---which can execute data-intensive machine learning tasks, but add complexity across the stack when dealing with intermittent power. We present Protean to bridge the platform gap for inference-capable battery-free sensors. Designed for runtime scalability, meeting the dynamic range of energy harvesters with matching heterogeneous processing elements like neural network accelerators. We develop a modular "plug-and-play" hardware platform, SuperSensor, with a reconfigurable energy storage circuit that powers a 32-bit ARM-based microcontroller with a convolutional neural network accelerator. An adaptive task-based runtime system, Chameleon, provides intermittency-proof execution of machine learning tasks across heterogeneous processing elements. The runtime automatically scales and dispatches these tasks based on incoming energy, current state, and programmer annotations. A code generator, Metamorph, automates conversion of ML models to intermittent safe execution across heterogeneous compute elements. We evaluate Protean with audio and image workloads and demonstrate up to 666x improvement in inference energy efficiency by enabling usage of modern computational elements within intermittent computing. Further, Protean provides up to 166% higher throughput compared to non-adaptive baselines.  more » « less
Award ID(s):
2145584 2038853
NSF-PAR ID:
10398838
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems (SenSys'22)
Page Range / eLocation ID:
207 to 221
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ko, Steve (Ed.)
    Today's smart devices have short battery lifetimes, high installation and maintenance costs, and rapid obsolescence - all leading to the explosion of electronic waste in the past two decades. These problems will worsen as the number of connected devices grows to one trillion by 2035. Energy harvesting, battery-free devices offer an alternative. Getting rid of the battery reduces e-waste, promises long lifetimes, and enables deployment in new applications and environments. Unfortunately, developing sophisticated inference-capable applications is still challenging. The lack of platform support for advanced (32-bit) microprocessors and specialized accelerators, which can execute dataintensive machine-learning tasks, has held back batteryless devices. 
    more » « less
  2. Efficient and adaptive computer vision systems have been proposed to make computer vision tasks, such as image classification and object detection, optimized for embedded or mobile devices. These solutions, quite recent in their origin, focus on optimizing the model (a deep neural network, DNN) or the system by designing an adaptive system with approximation knobs. Despite several recent efforts, we show that existing solutions suffer from two major drawbacks. First , while mobile devices or systems-on-chips (SOCs) usually come with limited resources including battery power, most systems do not consider the energy consumption of the models during inference. Second , they do not consider the interplay between the three metrics of interest in their configurations, namely, latency, accuracy, and energy. In this work, we propose an efficient and adaptive video object detection system — Virtuoso , which is jointly optimized for accuracy, energy efficiency, and latency. Underlying Virtuoso is a multi-branch execution kernel that is capable of running at different operating points in the accuracy-energy-latency axes, and a lightweight runtime scheduler to select the best fit execution branch to satisfy the user requirement. We position this work as a first step in understanding the suitability of various object detection kernels on embedded boards in the accuracy-latency-energy axes, opening the door for further development in solutions customized to embedded systems and for benchmarking such solutions. Virtuoso is able to achieve up to 286 FPS on the NVIDIA Jetson AGX Xavier board, which is up to 45 times faster than the baseline EfficientDet D3 and 15 times faster than the baseline EfficientDet D0. In addition, we also observe up to 97.2% energy reduction using Virtuoso compared to the baseline YOLO (v3) — a widely used object detector designed for mobiles. To fairly compare with Virtuoso , we benchmark 15 state-of-the-art or widely used protocols, including Faster R-CNN (FRCNN) [NeurIPS’15], YOLO v3 [CVPR’16], SSD [ECCV’16], EfficientDet [CVPR’20], SELSA [ICCV’19], MEGA [CVPR’20], REPP [IROS’20], FastAdapt [EMDL’21], and our in-house adaptive variants of FRCNN+, YOLO+, SSD+, and EfficientDet+ (our variants have enhanced efficiency for mobiles). With this comprehensive benchmark, Virtuoso has shown superiority to all the above protocols, leading the accuracy frontier at every efficiency level on NVIDIA Jetson mobile GPUs. Specifically, Virtuoso has achieved an accuracy of 63.9%, which is more than 10% higher than some of the popular object detection models, FRCNN at 51.1%, and YOLO at 49.5%. 
    more » « less
  3. Battery-free sensing devices harvest energy from their surrounding environment to perform sensing, computation, and communication. This enables previously impossible applications in the Internet-of-Things. A core challenge for these devices is maintaining usefulness despite erratic, random or irregular energy availability; which causes inconsistent execution, loss of service and power failures. Adapting execution (degrading or upgrading) seems promising as a way to stave off power failures, meet deadlines, or increase throughput. However, because of constrained resources and limited local information, it is a challenge to decide when would be the best time to adapt, and how exactly to adapt execution. In this paper, we systematically explore the fundamental mechanisms of energy-aware adaptation, and propose heuristic adaptation as a method for modulating the performance of tasks to enable higher sensor coverage, completion rates, or throughput, depending on the application. We build a task based adaptive runtime system for intermittently powered sensors embodying this concept. We complement this runtime with a user facing simulator that enables programmers to conceptualize the tradeoffs they make when choosing what tasks to adapt, and how, relative to real world energy harvesting environment traces. While we target battery-free, intermittently powered sensors, we see general application to all energy harvesting devices. We explore heuristic adaptation with varied energy harvesting modalities and diverse applications: machine learning, activity recognition, and greenhouse monitoring, and find that the adaptive version of our ML app performs up to 46% more classifications with only a 5% drop in accuracy; the activity recognition app captures 76% more classifications with only nominal down-sampling; and find that heuristic adaptation leads to higher throughput versus non-adaptive in all cases. 
    more » « less
  4. Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic (PL) with AI Engine processors (AIE) optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can theoretically provide up to 6.4 TFLOPs performance for 32-bit floating-point (fp32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. In our investigation, we observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises: How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch of massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to compose multiple diverse MM accelerator architectures working concurrently towards different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate the system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework for four different deep learning applications, including BERT, ViT, NCF, MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPs, 1.61 TFLOPs, 1.74 TFLOPs, and 2.94 TFLOPs inference throughput for BERT, ViT, NCF, MLP, respectively, which obtain 5.40x, 32.51x, 1.00x and 1.00x throughput gains compared to one monolithic accelerator. 
    more » « less
  5. Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud plat- forms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multi- ple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substan- tially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evalu- ate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adapta- tion policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications. 
    more » « less