Many IoT applications have increasingly adopted machine learning (ML) techniques, such as classification and detection, to enhance automation and decision-making processes. With advances in hardware accelerators such as Nvidia’s Jetson embedded GPUs, the computational capabilities of end devices, particularly for ML inference workloads, have significantly improved in recent years. These advances have opened opportunities for distributing computation across the edge network, enabling optimal resource utilization and reducing request latency. Previous research has demonstrated promising results in collaborative inference, where processing units in the edge network, such as end devices and edge servers, collaboratively execute an inference request to minimize latency.This paper explores approaches for implementing collaborative inference on a single model in resource-constrained edge networks, including on-device, device-edge, and edge-edge collaboration. We present preliminary results from proof-of-concept experiments to support each case. We discuss dynamic factors that can impact the performance of these inference execution strategies, such as network variability, thermal constraints, and workload fluctuations. Finally, we outline potential directions for future research. 
                        more » 
                        « less   
                    
                            
                            Model-driven Cluster Resource Management for AI Workloads in Edge Clouds
                        
                    
    
            Since emerging edge applications such as Internet of Things (IoT) analytics and augmented reality have tight latency constraints, hardware AI accelerators have been recently proposed to speed up deep neural network (DNN) inference run by these applications. Resource-constrained edge servers and accelerators tend to be multiplexed across multiple IoT applications, introducing the potential for performance interference between latency-sensitive workloads. In this article, we design analytic models to capture the performance of DNN inference workloads on shared edge accelerators, such as GPU and edgeTPU, under different multiplexing and concurrency behaviors. After validating our models using extensive experiments, we use them to design various cluster resource management algorithms to intelligently manage multiple applications on edge accelerators while respecting their latency constraints. We implement a prototype of our system in Kubernetes and show that our system can host 2.3× more DNN applications in heterogeneous multi-tenant edge clusters with no latency violations when compared to traditional knapsack hosting algorithms. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10409959
- Date Published:
- Journal Name:
- ACM Transactions on Autonomous and Adaptive Systems
- Volume:
- 18
- Issue:
- 1
- ISSN:
- 1556-4665
- Page Range / eLocation ID:
- 1 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud plat- forms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multi- ple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substan- tially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evalu- ate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adapta- tion policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications.more » « less
- 
            null (Ed.)Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting.more » « less
- 
            null (Ed.)Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×.more » « less
- 
            In pursuit of higher inference accuracy, deep neural network (DNN) models have significantly increased in complexity and size. To overcome the consequent computational challenges, scalable chiplet-based accelerators have been proposed. However, data communication using metallic-based interconnects in these chiplet-based DNN accelerators is becoming a primary obstacle to performance, energy efficiency, and scalability. The photonic interconnects can provide adequate data communication support due to some superior properties like low latency, high bandwidth and energy efficiency, and ease of broadcast communication. In this paper, we propose SPACX: a Silicon Photonics-based Chiplet ACcelerator for DNN inference applications. Specifically, SPACX includes a photonic network design that enables seamless single-chiplet and cross-chiplet broadcast communications, and a tailored dataflow that promotes data broadcast and maximizes parallelism. Furthermore, we explore the broadcast granularities of the photonic network and implications on system performance and energy efficiency. A flexible bandwidth allocation scheme is also proposed to dynamically adjust communication bandwidths for different types of data. Simulation results using several DNN models show that SPACX can achieve 78% and 75% reduction in execution time and energy, respectively, as compared to other state-of-the-art chiplet-based DNN accelerators.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    