skip to main content


Title: EdgePC: Efficient Deep Learning Analytics for Point Clouds on Edge Devices
Recently, point cloud (PC) has gained popularity in modeling various 3D objects (including both synthetic and real-life) and has been extensively utilized in a wide range of applications such as AR/VR, 3D reconstruction, and autonomous driving. For such applications, it is critical to analyze/understand the surrounding scenes properly. To achieve this, deep learning based methods (e.g., convolutional neural networks (CNNs)) have been widely employed for higher accuracy. Unlike the deep learning on conventional 2D images/videos, where the feature computation (matrix multiplication) is the major bottleneck, in point cloud-based CNNs, the sample and neighbor search stages are the primary bottlenecks, and collectively contribute to 54% (up to 80%) of the overall execution latency on a typical edge device. While prior efforts have attempted to solve this issue by designing custom ASICs or pipelining the neighbor search with other stages, to our knowledge, none of them has tried to “structurize” the unstructured PC data for improving computational efficiency. In this paper, we first explore the opportunities of structurizing PC data using Morton code (which is originally designed to map data from a high dimensional space to one dimension, while preserving spatial locality) and observe that there is a huge scope to “skip” the sample and neighbor search computation by operating on the “structurized” PC data. Based on this, we propose two approximation techniques for the sampling and neighbor search stages. We implemented our proposals on an NVIDIA Jetson AGX Xavier edge GPU board. The evaluation results collected on six different workloads show that our design can accelerate the sample and neighbor search stages by 3.68× (up to 5.21×) with minimal impact on inference accuracy. This acceleration in turn results in 1.55× speedup in the end-to-end execution latency and saves 33% of energy expenditure.  more » « less
Award ID(s):
1763681
NSF-PAR ID:
10440142
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
International Symposium on Computer Architecture 2023
Page Range / eLocation ID:
1 to 14
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the proliferation of low-cost sensors and the Internet of Things, the rate of producing data far exceeds the compute and storage capabilities of today’s infrastructure. Much of this data takes the form of time series, and in response, there has been increasing interest in the creation of time series archives in the last decade, along with the development and deployment of novel analysis methods to process the data. The general strategy has been to apply a plurality of similarity search mechanisms to various subsets and subsequences of time series data in order to identify repeated patterns and anomalies; however, the computational demands of these approaches renders them incompatible with today’s power-constrained embedded CPUs. To address this challenge, we present FA-LAMP, an FPGA-accelerated implementation of the Learned Approximate Matrix Profile (LAMP) algorithm, which predicts the correlation between streaming data sampled in real-time and a representative time series dataset used for training. FA-LAMP lends itself as a real-time solution for time series analysis problems such as classification. We present the implementation of FA-LAMP on both edge- and cloud-based prototypes. On the edge devices, FA-LAMP integrates accelerated computation as close as possible to IoT sensors, thereby eliminating the need to transmit and store data in the cloud for posterior analysis. On the cloud-based accelerators, FA-LAMP can execute multiple LAMP models on the same board, allowing simultaneous processing of incoming data from multiple data sources across a network. LAMP employs a Convolutional Neural Network (CNN) for prediction. This work investigates the challenges and limitations of deploying CNNs on FPGAs using the Xilinx Deep Learning Processor Unit (DPU) and the Vitis AI development environment. We expose several technical limitations of the DPU, while providing a mechanism to overcome them by attaching custom IP block accelerators to the architecture. We evaluate FA-LAMP using a low-cost Xilinx Ultra96-V2 FPGA as well as a cloud-based Xilinx Alveo U280 accelerator card and measure their performance against a prototypical LAMP deployment running on a Raspberry Pi 3, an Edge TPU, a GPU, a desktop CPU, and a server-class CPU. In the edge scenario, the Ultra96-V2 FPGA improved performance and energy consumption compared to the Raspberry Pi; in the cloud scenario, the server CPU and GPU outperformed the Alveo U280 accelerator card, while the desktop CPU achieved comparable performance; however, the Alveo card offered an order of magnitude lower energy consumption compared to the other four platforms. Our implementation is publicly available at https://github.com/aminiok1/lamp-alveo. 
    more » « less
  2. In recent years, machine learning research has largely shifted focus from the cloud to the edge. While the resulting algorithm- and hardware-level optimizations have enabled local execution for the majority of deep neural networks (DNNs) on edge devices, the sheer magnitude of DNNs associated with real-time video detection workloads has forced them to remain relegated to remote execution in the cloud. This problematic when combined with the strict latency requirements that are coupled with these workloads, and imposes a unique set of challenges not directly addressed in prior works. In this work, we design MobiEye, a cloud-based video detection system optimized for deployment in real-time mobile applications. MobiEye is able to achieve up to a 32% reduction in latency when compared to a conventional implementation of video detection system with only a marginal reduction in accuracy. 
    more » « less
  3. Graph Neural Networks (GNNs) are becoming increasingly popular for vision-based applications due to their intrinsic capacity in modeling structural and contextual relations between various parts of an image frame. On another front, the rising popularity of deep vision-based applications at the edge has been facilitated by the recent advancements in heterogeneous multi-processor Systems on Chips (MPSoCs) that enable inference under real-time, stringent execution requirements. By extension, GNNs employed for vision-based applications must adhere to the same execution requirements. Yet contrary to typical deep neural networks, the irregular flow of graph learning operations poses a challenge to running GNNs on such heterogeneous MPSoC platforms. In this paper, we propose a novel unifieddesign-mappingapproach for efficient processing of vision GNN workloads on heterogeneous MPSoC platforms. Particularly, we develop MaGNAS, a mapping-aware Graph Neural Architecture Search framework. MaGNAS proposes a GNN architectural design space coupled with prospective mapping options on a heterogeneous SoC to identify model architectures that maximize on-device resource efficiency. To achieve this, MaGNAS employs a two-tier evolutionary search to identify optimalGNNsandmappingpairings that yield the best performance trade-offs. Through designing a supernet derived from the recent Vision GNN (ViG) architecture, we conducted experiments on four (04) state-of-the-art vision datasets using both (i) a real hardware SoC platform (NVIDIA Xavier AGX) and (ii) a performance/cost model simulator for DNN accelerators. Our experimental results demonstrate that MaGNAS is able to provide1.57× latency speedup and is3.38× more energy-efficient for several vision datasets executed on the Xavier MPSoC vs. the GPU-only deployment while sustaining an average0.11%accuracy reduction from the baseline.

     
    more » « less
  4. Point cloud computation has become an increasingly more important workload thanks to its applications in autonomous driving. Unlike dense 2D computation, point cloud convolution has sparse and irregular computation patterns and thus requires dedicated inference system support with specialized high-performance kernels. While existing point cloud deep learning libraries have developed different dataflows for convolution on point clouds, they assume a single dataflow throughout the execution of the entire model. In this work, we systematically analyze and improve existing dataflows. Our resulting system, TorchSparse++, achieves 2.9x, 3.3x, 2.2x and 1.8x measured end-to-end speedup on an NVIDIA A100 GPU over the state-of-the-art MinkowskiEngine, SpConv 1.2, TorchSparse and SpConv v2 in inference respectively. Furthermore, TorchSparse++ is the only system to date that supports all necessary primitives for 3D segmentation, detection, and reconstruction workloads in autonomous driving. Code is publicly released at https://github.com/mit-han-lab/torchsparse. 
    more » « less
  5. The proliferation of innovative mobile services such as augmented reality, networked gaming, and autonomous driving has spurred a growing need for low-latency access to computing resources that cannot be met solely by existing centralized cloud systems. Mobile Edge Computing (MEC) is expected to be an effective solution to meet the demand for low-latency services by enabling the execution of computing tasks at the network-periphery, in proximity to end-users. While a number of recent studies have addressed the problem of determining the execution of service tasks and the routing of user requests to corresponding edge servers, the focus has primarily been on the efficient utilization of computing resources, neglecting the fact that non-trivial amounts of data need to be stored to enable service execution, and that many emerging services exhibit asymmetric bandwidth requirements. To fill this gap, we study the joint optimization of service placement and request routing in MEC-enabled multi-cell networks with multidimensional (storage-computation-communication) constraints. We show that this problem generalizes several problems in literature and propose an algorithm that achieves close-to-optimal performance using randomized rounding. Evaluation results demonstrate that our approach can effectively utilize the available resources to maximize the number of requests served by low-latency edge cloud servers. 
    more » « less