skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 17, 2025

Title: Fast and Efficient Scaling for Microservices with SurgeGuard
The microservice architecture is increasingly popular for flexible, large-scale online applications. However, existing resource management mechanisms incur high latency in detecting Quality of Service (QoS) violations, and hence, fail to allocate resources effectively under commonly-observed varying load conditions. This results in over-allocation coupled with a late response that increase both the total cost of ownership and the magnitude of each QoS violation event. We present SurgeGuard, a decentralized resource controller for microservice applications specifically designed to guard application QoS during surges in load and network latency. SurgeGuard uses the key insight that for rapid detection and effective management of QoS violations, the controller must be aware of any available slack in latency and communication patterns between microservices within a task-graph. Our experiments show that for the workloads in DeathStarBench, SurgeGuard on average reduces the combined violation magnitude and duration by 61.1% and 93.7%, respectively, compared to the well-known Parties and Caladan algorithms, and requires 8% fewer resources than Parties  more » « less
Award ID(s):
2212579
PAR ID:
10632610
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-5291-7
Page Range / eLocation ID:
1 to 15
Subject(s) / Keyword(s):
Cloud computing microservices serverless quality-of-service resource management datacenters
Format(s):
Medium: X
Location:
Atlanta, GA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16Å~ while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11Å~. 
    more » « less
  2. The slowdown of Moore’s Law, combined with advances in 3D stacking of logic and memory, have pushed architects to revisit the concept of processing-in-memory (PIM) to overcome the memory wall bottleneck. This PIM renaissance finds itself in a very different computing landscape from the one twenty years ago, as more and more computation shifts to the cloud. Most PIM architecture papers still focus on best-effort applications, while PIM’s impact on latency-critical cloud applications is not well understood. This paper explores how datacenters can exploit PIM architectures in the context of latency-critical applications. We adopt a general-purpose cloud server with HBM-based, 3D-stacked logic+memory modules, and study the impact of PIM on six diverse interactive cloud applications. We reveal the previously neglected opportunity that PIM presents to these services, and show the importance of properly managing PIM-related resources to meet the QoS targets of interactive services and maximize resource efficiency. Then, we present PIMCloud, a QoS-aware resource manager designed for cloud systems with PIM allowing colocation of multiple latency-critical and best-effort applications. We show that PIMCloud efficiently manages PIM resources: it (1) improves effective machine utilization by up to 70% and 85% (average 24% and 33%) under 2-app and 3-app mixes, compared to the best state-of-the-art manager; (2) helps latency-critical applications meet QoS; and (3) adapts to varying load patterns. 
    more » « less
  3. Advances in virtualization technologies and edge computing have inspired a new paradigm for Internet-of-Things (IoT) application development. By breaking a monolithic application into loosely coupled microservices, great gain can be achieved in performance, flexibility and robustness. In this paper, we study the important problem of load balancing across IoT microservice instances. A key difficulty in this problem is the interdependencies among microservices: the load on a successor microservice instance directly depends on the load distributed from its predecessor microservice instances. We propose a graph-based model for describing the load dependencies among microservices. Based on the model, we first propose a basic formulation for load balancing, which can be solved optimally in polynomial time. The basic model neglects the quality-of-service (QoS) of the IoT application. We then propose a QoS-aware load balancing model, based on a novel abstraction that captures a realization of the application’s internal logic. The QoS-aware load balancing problem is NP-hard. We propose a fully polynomialtime approximation scheme for the QoS-aware problem. We show through simulation experiments that our proposed algorithm achieves enhanced QoS compared to heuristic solutions. 
    more » « less
  4. Compute heterogeneity is increasingly gaining prominence in modern datacenters due to the addition of accelerators like GPUs and FPGAs. We observe that datacenter schedulers are agnostic of these emerging accelerators, especially their resource utilization footprints, and thus, not well equipped to dynamically provision them based on the application needs. We observe that the state-of-the-art datacenter schedulers fail to provide fine-grained resource guarantees for latency-sensitive tasks that are GPU-bound. Specifically for GPUs, this results in resource fragmentation and interference leading to poor utilization of allocated GPU resources. Furthermore, GPUs exhibit highly linear energy efficiency with respect to utilization and hence proactive management of these resources is essential to keep the operational costs low while ensuring the end-to-end Quality of Service (QoS) in case of user-facing queries.Towards addressing the GPU orchestration problem, we build Knots, a GPU-aware resource orchestration layer and integrate it with the Kubernetes container orchestrator to build Kube- Knots. Kube-Knots can dynamically harvest spare compute cycles through dynamic container orchestration enabling co-location of latency-critical and batch workloads together while improving the overall resource utilization. We design and evaluate two GPU-based scheduling techniques to schedule datacenter-scale workloads through Kube-Knots on a ten node GPU cluster. Our proposed Correlation Based Prediction (CBP) and Peak Prediction (PP) schemes together improves both average and 99 th percentile cluster-wide GPU utilization by up to 80% in case of HPC workloads. In addition, CBP+PP improves the average job completion times (JCT) of deep learning workloads by up to 36% when compared to state-of-the-art schedulers. This leads to 33% cluster-wide energy savings on an average for three different workloads compared to state-of-the-art GPU-agnostic schedulers. Further, the proposed PP scheduler guarantees the end-to-end QoS for latency-critical queries by reducing QoS violations by up to 53% when compared to state-of-the-art GPU schedulers. 
    more » « less
  5. Datacenters use accelerators to provide the significant compute throughput required by emerging user-facing services. The diurnal user access pattern of user-facing services provides a strong incentive to co-located applications for better accelerator utilization, and prior work has focused on enabling co-location on multicore processors and traditional non-preemptive accelerators. However, current accelerators are evolving towards spatial multitasking and introduce a new set of challenges to eliminate QoS violation. To address this open problem, we explore the underlying causes of QoS violation on spatial multitasking accelerators. In response to these causes, we propose Laius, a runtime system that carefully allocates the computation resource to co-located applications for maximizing the throughput of batch applications while guaranteeing the required QoS of user-facing services. Our evaluation on a Nvidia RTX 2080Ti GPU shows that Laius improves the utilization of spatial multitasking accelerators by 20.8%, while achieving the 99%-ile latency target for user-facing services. 
    more » « less