skip to main content


Title: Sage: Leveraging ML To Diagnose Unpredictable Performance in Cloud Microservices
Cloud applications are increasingly shifting from large monolithic services, to complex graphs of loosely-coupled microservices. Despite their advantages, microservices also introduce cascading QoS violations in cloud applications, which are difficult to diagnose and correct. We present Sage, a ML-driven root cause analysis system for interactive cloud microservices. Sage leverages unsupervised learning models to circumvent the overhead of trace labeling, determines the root cause of unpredictable performance online, and applies corrective actions to restore performance. On experiments on both dedicated local clusters and large GCE clusters we show that Sage achieves high root cause detection accuracy and predictable performance.  more » « less
Award ID(s):
1846046
NSF-PAR ID:
10165232
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ML for Computer Architecture and Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cloud applications are increasingly shifting from large monolithic services to complex graphs of loosely-coupled microservices. Despite their benefits, microservices are prone to cascading performance issues, and can lead to prolonged periods of degraded performance. We present Sage, a machine learning-driven root cause analysis system for interactive cloud microservices that is both accurate and practical. We show that Sage correctly identifies the root causes of performance issues across a diverse set of microservices and takes action to address them, leading to more predictable, performant, and efficient cloud systems. 
    more » « less
  2. Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meetan application’s end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-endlatency and can introduce performance unpredictability and QoS violations. This work presents our early work onDagger, a hardwareacceleration platform for networking, designed specifically with the unique qualities of microservices in mind. The Dagger architecturerelies on an FPGA-based NIC, closely coupled with the processor over a configurable memory interconnect, designed to offload andaccelerate RPC stacks. Unlike the traditional cloud systems that use PCIe links as the NIC I/O interface, we leverage memory-interconnectedFPGAs as networking devices to provide the efficiency, transparency, and programmability needed for fine-grained microservices. We showthat this considerably improves CPU utilization and performance for cloud RPCs. 
    more » « less
  3. Cloud-native microservice applications use different communication paradigms to network microservices, including both synchronous and asynchronous I/O for exchanging data. Existing solutions depend on kernel-based networking, incurring significant overheads. The interdependence between microservices for these applications involves considerable communication, including contention between multiple concurrent flows or user sessions. In this paper, we design X-IO, a high-performance unified I/O interface that is built on top of shared memory processing with lock-free producer/consumer rings, eliminating kernel networking overheads and contention. X-IO offers a feature-rich interface. X-IO’s zero-copy interface supports building provides truly zero-copy data transfers between microservices, achieving high performance. X-IO also provides a POSIX-like socket interface using HTTP/REST API to achieve seamless porting of microservices to X-IO, without any change to the application code. X-IO supports concurrent connections for microservices that require distinct user sessions operating in parallel. Our preliminary experimental results show that X-IO’s zero-copy interfaces achieve 2.8x-4.1x performance improvement compared to kernel-based interfaces. Its socket interfaces outperform kernel TCP sockets and achieve performance close to UNIX-domain sockets. The HTTP/REST APIs in X-IO perform 1.4 x-2.3 x better than kernel-based alternatives with concurrent connections. 
    more » « less
  4. Cloud applications are increasingly shifting to interactive and loosely-coupled microservices. Despite their advantages, microservices complicate resource management, due to inter-tier dependencies. We present Sinan, a cluster manager for interactive microservices that leverages easily-obtainable tracing data instead of empirical decisions, to infer the impact of a resource allocation on end-to-end performance, and allocate appropriate resources to each tier. In a preliminary evaluation of Sinan with an end-to-end social network built with microservices, we show that Sinan’s data-driven approach, allows the service to always meet its QoS without sacrificing resource efficiency. 
    more » « less
  5. Cloud applications are increasingly shifting to interactive and loosely-coupled microservices. Despite their advantages, microservices complicate resource management, due to inter-tier dependencies. We present Sinan and PuppetMaster, two cluster managers for interactive microservices that leverages easily-obtainable tracing data instead of empirical decisions, to infer the impact of a resource allocation on end-to-end performance, and allocate appropriate resources to each tier. In a preliminary evaluation of the system with an end-to-end social network built with microservices, we show that the cluster manager's data-driven approach allows the service to always meet its QoS without sacrificing resource efficiency. 
    more » « less