Microservice architecture is the computing paradigm of choice for large, service-oriented software catering to real-time requests. Individual programs in such a system perform Remote Procedure Calls (RPCs) to other microservices to accomplish sub-tasks. Microservices are designed to be robust; top-level requests can succeed despite errors returned from RPC sub-tasks, referred to asnon-fatal errors.Because of this design, the top-level microservices tend to ''live with'' non-fatal errors. Hence, a natural question to ask is ''how prevalent are non-fatal errors and what impact do they have on the exposed latency of top-level requests?'' In this paper, we present a large-scale study of errors in microservices. We answer the aforementioned question by analyzing 11 Billion RPCs covering 1,900 user-facing endpoints at the Uber serving requests of hundreds of millions of active users. To assess the latency impact of non-fatal errors, we develop a methodology that projects potential latency savings for a given request as if the time spent on failing APIs were eliminated. This estimator allows ranking and bubbling up those APIs that are worthy of further investigations, where the non-fatal errors likely resulted in operational inefficiencies. Finally, we employ our error detection and impact estimation techniques to pinpoint operational inefficiencies, which a) result in a tail latency reduction of a critical endpoint by 30% and b) offer insights into common inefficiency-introducing patterns.
more »
« less
The Tale of Errors in Microservices: Extended Abstract
Microservice architectures have become the de facto paradigm for building scalable, service-oriented systems. Although their decentralized design promotes resilience and rapid development, the inherent complexity leads to subtle performance challenges. In particular,non-fatalerrors - internal failures of remote procedure calls that do not cause top-level request failures - can accumulate along the critical path, inflating latency and wasting resources. In this work, we analyze over 11 billion RPCs across more than 6,000 microservices at Uber. Our study shows that nearly 29% of successful requests experience non-fatal errors that remain hidden in traditional monitoring. We propose a novellatency-reduction estimator(LR estimator) to quantify the potential benefit of eliminating these errors. Our contributions include a systematic study of RPC error patterns, a methodology to estimate latency reductions, and case studies demonstrating up to a 30% reduction in tail latency.
more »
« less
- PAR ID:
- 10662448
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM SIGMETRICS Performance Evaluation Review
- Volume:
- 53
- Issue:
- 1
- ISSN:
- 0163-5999
- Page Range / eLocation ID:
- 61 to 63
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Although nodal spin-triplet topological superconductivity appears probable in uranium ditelluride (UTe2), its superconductive order parameter Δkremains unestablished. In theory, a distinctive identifier would be the existence of a superconductive topological surface band, which could facilitate zero-energy Andreev tunneling to an s-wave superconductor and also distinguish a chiral from a nonchiral Δkthrough enhanced s-wave proximity. In this study, we used s-wave superconductive scan tips and detected intense zero-energy Andreev conductance at the UTe2(0-11) termination surface. Imaging revealed subgap quasiparticle scattering interference signatures witha-axis orientation. The observed zero-energy Andreev peak splitting with enhanced s-wave proximity signifies that Δkof UTe2is a nonchiral state:B1u,B2u, orB3u. However, if the quasiparticle scattering along theaaxis is internodal, then a nonchiralB3ustate is the most consistent for UTe2.more » « less
-
This paper introduces HotStuff-1, a BFT consensus protocol that improves the latency of HotStuff-1 by two network hops while maintaining linear communication complexity against faults. Furthermore, HotStuff-1 incorporates an incentive-compatible leader rotation design that motivates leaders to propose transactions promptly. HotStuff-1 achieves a reduction of two network hops byspeculativelysending clients early finality confirmations, after one phase of the protocol. Introducing speculation into streamlined protocols is challenging because, unlike stable-leader protocols, these protocols cannot stop the consensus and recover from failures. Thus, we identifyprefix speculation dilemmain the context of streamlined protocols; HotStuff-1 is the first streamlined protocol to resolve it. HotStuff-1 embodies an additional mechanism,slotting, that thwarts delays caused by (1) rationally-incentivized leaders and (2) malicious leaders inclined to sabotage others' progress. The slotting mechanism allows leaders to dynamically drive as many decisions as allowed by network transmission delays before view timers expire, thus mitigating both threats.more » « less
-
ObjectivesIn calls for excellent and equitable Computer Science (CS) education, the wordrigoroften appears, but it often goes undefined. The goal of this work is to understand how CS teachers, instructors, and students conceive of rigor. Research Questions:1) What do CS instructors think rigor is? and 2) What do students think rigor is? Methods:Using the principles of phenomenological research, we conducted a semi-structured interview study with 10 post-secondary CS students, 10 secondary CS teachers, and 9 post-secondary CS instructors, to understand their conceptions of rigor. Results:Analysis showed that no participants had the same understanding of rigor. We found that participants had abstractPrinciples of Rigorwhich included: Precision, Systematic Thought Process, Depth of Understanding, and Challenge. They also had concreteObservations of Rigorthat included Time and Effort, Intrinsic Drive, Productive Failure, Struggle, Outcomes, and Gatekeeping. Participants also sharedConditions for Rigorwhich included Expectations, Standards, Community Support, and Resources. Implications:Our data supports prior work that educators are using different definitions of rigor. This implies that each educator holds different expectations for students, without necessarily communicating these expectations to their students. In the best case, this might confuse students; in the worst case, it reinforces hegemonic norms which can lead to gatekeeping which prevents students from fully participating in the CS field. Based on these insights, we argue that to commit to the idea of quality CS learning, the community must discard the use of this concept of rigor to justify student learning and re-imagine alternate benchmarks.more » « less
-
Removing CO2from the atmosphere is emerging as a viable strategy to mitigate global warming, yet the responses of the climate system to CO2reduction remain uncertain. One of the most uncertain aspects of El Niño behavior is the change in periodicity in response to CO2forcing [O. Alizadeh,Earth-Sci. Rev.235, 104246 (2022)]. In this study, we show that climate models consistently project an abrupt shortening of El Niño periodicity once CO2reductions commence in ramp-up and ramp-down CO2experiments. Besides the contribution of slow mean state changes, this phenomenon is shown to be driven by a southward shift of the Intertropical Convergence Zone (ITCZ) [J.-S. Kug,et al.,Nat. Clim. Chang.12, 47–53 (2022)] and the consequent narrowing of El Niño’s spatial pattern, which enhances the effectiveness of ocean heat recharge/discharge processes, thereby shortening its periodicity. This suggests that the abrupt shift in El Niño periodicity results from a cascading reaction involving ITCZ dynamics and El Niño’s spatial configuration. These findings highlight the critical role of the global energy balance in shaping El Niño characteristics.more » « less
An official website of the United States government

