skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Power-aware Deep Learning Model Serving with µ-Serve. In Proceedings of the 2024 USENIX Annual Technical Conference (ATC 2024).
With the increasing popularity of large deep learning model serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, μ-Serve. μ-Serve is a model-serving framework that optimizes the power consumption and model serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that μ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations.  more » « less
Award ID(s):
2029049
PAR ID:
10546467
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Editor(s):
Begnum, Kyrre; Border, Charles
Publisher / Repository:
Usenix_Atc_24
Date Published:
Edition / Version:
1
Volume:
1
Issue:
1
ISBN:
978-1-939133-41-0
Page Range / eLocation ID:
75-93
Subject(s) / Keyword(s):
power-aware deep learning model
Format(s):
Medium: X Size: 1062 kb Other: pdf
Size(s):
1062 kb
Location:
Santa Clara, CA
Sponsoring Org:
National Science Foundation
More Like this
  1. Begnum, Kyrre; Border, Charles (Ed.)
    With the increasing popularity of large deep learning model-serving workloads, there is a pressing need to reduce the energy consumption of a model-serving cluster while maintaining satisfied throughput or model-serving latency requirements. Model multiplexing approaches such as model parallelism, model placement, replication, and batching aim to optimize the model-serving performance. However, they fall short of leveraging the GPU frequency scaling opportunity for power saving. In this paper, we demonstrate (1) the benefits of GPU frequency scaling in power saving for model serving; and (2) the necessity for co-design and optimization of fine-grained model multiplexing and GPU frequency scaling. We explore the co-design space and present a novel power-aware model-serving system, μ-Serve. μ-Serve is a model-serving framework that optimizes the power consumption and model-serving latency/throughput of serving multiple ML models efficiently in a homogeneous GPU cluster. Evaluation results on production workloads show that μ-Serve achieves 1.2–2.6× power saving by dynamic GPU frequency scaling (up to 61% reduction) without SLO attainment violations. 
    more » « less
  2. Cloud platforms’ rapid growth raises significant concerns about their electricity consumption and resulting carbon emissions. Power capping is a known technique for limiting the power consumption of data centers where workloads are hosted. Today’s data center computer clusters co-locate latency-sensitive web and throughput-oriented batch workloads. When power capping is necessary, throttling only the batch tasks without restricting latency-sensitive web workloads is ideal because guaranteeing low response time for latency-sensitive workloads is a must due to Service-Level Objectives (SLOs) requirements. This paper proposes PADS, a hardware-agnostic workload-aware power capping system. Due to not relying on any hardware mechanism such as RAPL and DVFS, it can keep the power consumption of clusters equipped with heterogeneous architectures such as x86 and ARM below the enforced power limit while minimizing the impact on latency-sensitive tasks. It uses an application-performance model of both latency-sensitive and batch workloads to ensure power safety with controllable performance. Our power capping technique uses diagonal scaling and relies on using the control group feature of the Linux kernel. Our results indicate that PADS is highly effective in reducing power while respecting the tail latency requirement of the latency-sensitive workload. Furthermore, compared to state-of-the-art solutions, PADS demonstrates lower P95 latency, accompanied by a 90% higher effectiveness in respecting power limits. 
    more » « less
  3. Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests. 
    more » « less
  4. null (Ed.)
    Low-latency online services have strict Service Level Objectives (SLOs) that require datacenter systems to support high throughput at microsecond-scale tail latency. Dataplane operating systems have been designed to scale up multi-core servers with minimal overhead for such SLOs. However, as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RackSched, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RackSched is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server. We use a combination of analytical results and simulations to show that it provides near-optimal performance as centralized scheduling policies, and is robust for both low-dispersion and high-dispersion workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k- choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement a RackSched prototype on a cluster of commodity servers connected by a Barefoot Tofino switch. End-to-end experiments on a twelve-server testbed show that RackSched improves the throughput by up to 1.44x, and scales out the throughput near linearly, while maintaining the same tail latency as one server until the system is saturated. 
    more » « less
  5. With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for throughput-oriented deep learning model training, can be used effectively for serving latency-bound model inference, e.g., in video analytics systems, on heterogeneous GPU clusters. Our work exploits the synergy between diversity in model layers and diversity in GPU architectures, which results in comparable inference latency for many layers when running on low-class and high-class GPUs. We explore how such overlooked capability of low-class GPUs can be exploited using pipeline parallelism and present a novel inference serving system, PPipe, that employs pool-based pipeline parallelism via an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Evaluation results on diverse workloads (18 CNN models) show that PPipe achieves 41.1%–65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, leading to 32.2%–75.1% higher serving throughput compared to various baselines. 
    more » « less