Microservice architectures have become the de facto paradigm for building scalable, service-oriented systems. Although their decentralized design promotes resilience and rapid development, the inherent complexity leads to subtle performance challenges. In particular,non-fatalerrors - internal failures of remote procedure calls that do not cause top-level request failures - can accumulate along the critical path, inflating latency and wasting resources. In this work, we analyze over 11 billion RPCs across more than 6,000 microservices at Uber. Our study shows that nearly 29% of successful requests experience non-fatal errors that remain hidden in traditional monitoring. We propose a novellatency-reduction estimator(LR estimator) to quantify the potential benefit of eliminating these errors. Our contributions include a systematic study of RPC error patterns, a methodology to estimate latency reductions, and case studies demonstrating up to a 30% reduction in tail latency.
more »
« less
The Tale of Errors in Microservices
Microservice architecture is the computing paradigm of choice for large, service-oriented software catering to real-time requests. Individual programs in such a system perform Remote Procedure Calls (RPCs) to other microservices to accomplish sub-tasks. Microservices are designed to be robust; top-level requests can succeed despite errors returned from RPC sub-tasks, referred to asnon-fatal errors.Because of this design, the top-level microservices tend to ''live with'' non-fatal errors. Hence, a natural question to ask is ''how prevalent are non-fatal errors and what impact do they have on the exposed latency of top-level requests?'' In this paper, we present a large-scale study of errors in microservices. We answer the aforementioned question by analyzing 11 Billion RPCs covering 1,900 user-facing endpoints at the Uber serving requests of hundreds of millions of active users. To assess the latency impact of non-fatal errors, we develop a methodology that projects potential latency savings for a given request as if the time spent on failing APIs were eliminated. This estimator allows ranking and bubbling up those APIs that are worthy of further investigations, where the non-fatal errors likely resulted in operational inefficiencies. Finally, we employ our error detection and impact estimation techniques to pinpoint operational inefficiencies, which a) result in a tail latency reduction of a critical endpoint by 30% and b) offer insights into common inefficiency-introducing patterns.
more »
« less
- PAR ID:
- 10662452
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Proceedings of the ACM on Measurement and Analysis of Computing Systems
- Volume:
- 8
- Issue:
- 3
- ISSN:
- 2476-1249
- Page Range / eLocation ID:
- 1 to 36
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cloud applications are increasingly relying on hundreds of loosely-coupled microservices to complete user requests that meetan application’s end-to-end QoS requirements. Communication time between services accounts for a large fraction of the end-to-endlatency and can introduce performance unpredictability and QoS violations. This work presents our early work onDagger, a hardwareacceleration platform for networking, designed specifically with the unique qualities of microservices in mind. The Dagger architecturerelies on an FPGA-based NIC, closely coupled with the processor over a configurable memory interconnect, designed to offload andaccelerate RPC stacks. Unlike the traditional cloud systems that use PCIe links as the NIC I/O interface, we leverage memory-interconnectedFPGAs as networking devices to provide the efficiency, transparency, and programmability needed for fine-grained microservices. We showthat this considerably improves CPU utilization and performance for cloud RPCs.more » « less
-
null (Ed.)The microservice architecture is a popular software engineering approach for building flexible, large-scale online services. Serverless functions, or function as a service (FaaS), provide a simple programming model of stateless functions which are a natural substrate for implementing the stateless RPC handlers of microservices, as an alternative to containerized RPC servers. However, current serverless platforms have millisecond-scale runtime overheads, making them unable to meet the strict sub-millisecond latency targets required by existing interactive microservices. We present Nightcore, a serverless function runtime with microsecond-scale overheads that provides container-based isolation between functions. Nightcore’s design carefully considers various factors having microsecond-scale overheads, including scheduling of function requests, communication primitives, threading models for I/O, and concurrent function executions. Nightcore currently supports serverless functions written in C/C++, Go, Node.js, and Python. Our evaluation shows that when running latency-sensitive interactive microservices, Nightcore achieves 1.36×–2.93× higher throughput and up to 69% reduction in tail latency.more » « less
-
Abstract We report a series of shape‐persistent molecular nanotubes with top rim connectivity traversing from an all‐meta‐ (m4) to an all‐para‐phenylene (p4) bridged species, including all possible members in between them. Single‐crystal X‐ray diffraction (SCXRD) and microcrystal electron diffraction (MicroED) data show a large torsional angle formeta‐phenylenes relative topara‐phenylene rings. Density functional theory (DFT) calculations reproduce the experimental torsional angles and also establish a correlation indicating a gradual increase in strain energy fromm4(∼31 kcal mol−1) top4(∼90 kcal mol−1). Structural transitions fromm4top4lead to additional correlations such as a shift in the lowest absorption wavelength from 330 to 394 nm, a sizeable red shift in the maximum emission wavelength from 444 to 546 nm, and a decrease in fluorescence quantum yield from 0.76 to 0.20, respectively. Time‐dependent (TD)‐DFT analysis of the relaxed excited state (S1’) geometry shows a progression of exciton delocalization aspara‐phenylenes are introduced intom4en route top4, while the overall molecular size remains constant. This effect is directly related to increased π‐conjugation within the nanotube's top‐segment and demonstrates how exciton trapping can take place without changing the nanotube's physical size, e.g., diameter and length.more » « less
-
null (Ed.)User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16Å~ while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11Å~.more » « less
An official website of the United States government

