skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PERSEUS: Characterizing Performance and Cost of Multi-Tenant Serving for CNN Models
Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present PERSEUS, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented PERSEUS in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging PERSEUS, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.  more » « less
Award ID(s):
1815619 1755659
PAR ID:
10156767
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Cloud Engineering (IC2E)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Edge computing has emerged as a popular paradigm for supporting mobile and IoT applications with low latency or high bandwidth needs. The attractiveness of edge computing has been further enhanced due to the recent availability of special-purpose hardware to accelerate specific compute tasks, such as deep learning inference, on edge nodes. In this paper, we experimentally compare the benefits and limitations of using specialized edge systems, built using edge accelerators, to more traditional forms of edge and cloud computing. Our experimental study using edge-based AI workloads shows that today's edge accelerators can provide comparable, and in many cases better, performance, when normalized for power or cost, than traditional edge and cloud servers. They also provide latency and bandwidth benefits for split processing, across and within tiers, when using model compression or model splitting, but require dynamic methods to determine the optimal split across tiers. We find that edge accelerators can support varying degrees of concurrency for multi-tenant inference applications, but lack isolation mechanisms necessary for edge cloud multi-tenant hosting. 
    more » « less
  2. Model-serving systems expose machine learning (ML) models to applications programmatically via a high-level API. Cloud plat- forms use these systems to mask the complexities of optimally managing resources and servicing inference requests across multi- ple applications. Model serving at the edge is now also becoming increasingly important to support inference workloads with tight latency requirements. However, edge model serving differs substan- tially from cloud model serving in its latency, energy, and accuracy constraints: these systems must support multiple applications with widely different latency and accuracy requirements on embedded edge accelerators with limited computational and energy resources. To address the problem, this paper presents Dělen,1 a flexible and adaptive model-serving system for multi-tenant edge AI. Dělen exposes a high-level API that enables individual edge applications to specify a bound at runtime on the latency, accuracy, or energy of their inference requests. We efficiently implement Dělen using conditional execution in multi-exit deep neural networks (DNNs), which enables granular control over inference requests, and evalu- ate it on a resource-constrained Jetson Nano edge accelerator. We evaluate Dělen flexibility by implementing state-of-the-art adapta- tion policies using Dělen’s API, and evaluate its adaptability under different workload dynamics and goals when running single and multiple applications. 
    more » « less
  3. With the growing demand for enhanced performance and scalability in cloud applications and systems, data center architectures are evolving to incorporate heterogeneous computing fabrics that leverage CPUs, GPUs, and FPGAs. Unlike traditional processing platforms like CPUs and GPUs, FPGAs offer the unique ability for hardware reconfiguration at run-time, enabling improved and tailored performance, flexibility, and acceleration. FPGAs excel at executing large-scale search optimization, acceleration, and signal processing tasks while consuming low power and minimizing latency. Major public cloud providers, such as Amazon, Huawei, Microsoft, Alibaba, and others, have already begun integrating FPGA-based cloud acceleration services into their offerings. Although FPGAs in cloud applications facilitate customized hardware acceleration, they also introduce new security challenges that demand attention. Granting cloud users the capability to reconfigure hardware designs after deployment may create potential vulnerabilities for malicious users, thereby jeopardizing entire cloud platforms. In particular, multi-tenant FPGA services, where a single FPGA is divided spatially among multiple users, are highly vulnerable to such attacks. This paper examines the security concerns associated with multi-tenant cloud FPGAs, provides a comprehensive overview of the related security, privacy and trust issues, and discusses forthcoming challenges in this evolving field of study. 
    more » « less
  4. null (Ed.)
    Dynamically reallocating computing resources to handle bursty workloads is a common practice for web applications (e.g., e-commerce) in clouds. However, our empirical analysis on a standard n-tier benchmark application (RUBBoS) shows that simply scaling an n-tier application by reallocating hardware resources without fast adapting soft resources (e.g., server threads, connections) may lead to large response time fluctuations. This is because soft resources control the workload concurrency of component servers in the system: adding or removing hardware resources such as Virtual Machines (VMs) can implicitly change the workload concurrency of dependent servers, causing either under- or over-utilization of the critical hardware resource in the system. To quickly identify the optimal soft resource allocation of each server in the system and stabilize response time fluctuation, we propose a novel Scatter-Concurrency-Throughput (SCT) model based on the monitoring of each server's real-time concurrency and throughput. We then implement a Concurrency-aware system Scaling (ConScale) framework which integrates the SCT model to fast adapt the soft resource allocations of key servers during the system scaling process. Our experiments using six realistic bursty workload traces show that ConScale can effectively mitigate the response time fluctuations of the target web application compared to the state-of-the-art cloud scaling strategies such as EC2-AutoScaling. 
    more » « less
  5. With faster wireless networks and server GPUs, offloading high-accuracy but compute-intensive AR tasks implemented in Deep Neural Networks (DNNs) to edge servers offers a promising way to support high-QoE Augmented/Mixed Reality (AR/MR) applications. A cost-effective way for AR app vendors to deploy such edge-assisted AR apps to support a large user base is to use commercial Machine-Learning-as-a-Service (MLaaS) deployed at the edge cloud. To maximize cost-effectiveness, such an MLaaS provider faces a key design challenge, \ie how to maximize the number of clients concurrently served by each GPU server in its cluster while meeting per-client AR task accuracy SLAs. The above AR offloading inference serving problem differs from generic inference serving or video analytics serving in one fundamental way: due to the use of local tracking which reuses the last server-returned inference result to derive results for the current frame, the offloading frequency and end-to-end latency of each AR client directly affect its AR task accuracy (for all the frames). In this paper, we present ARISE, a framework that optimizes the edge server capacity in serving edge-assisted AR clients. Our design exploits the intricate interplay between per-client offloading schedule and batched inference on the server via proactively coordinating offloading request streams from different AR clients. Our evaluation using a large set of emulated AR clients and a 10-phone testbed shows that \name supports 1.7x--6.9x more clients compared to various baselines while keeping the per-client accuracy within the client-specified accuracy SLAs. 
    more » « less