The Tale of Errors in Microservices

Lee, I-Ting Angelina (ORCID:0000000206875508); Zhang, Zhizhou (ORCID:0000000255176308); Parwal, Abhishek (ORCID:0009000840213541); Chabbi, Milind (ORCID:0000000310217644)

doi:10.1145/3700436

Microservice architecture is the computing paradigm of choice for large, service-oriented software catering to real-time requests. Individual programs in such a system perform Remote Procedure Calls (RPCs) to other microservices to accomplish sub-tasks. Microservices are designed to be robust; top-level requests can succeed despite errors returned from RPC sub-tasks, referred to asnon-fatal errors.Because of this design, the top-level microservices tend to ''live with'' non-fatal errors. Hence, a natural question to ask is ''how prevalent are non-fatal errors and what impact do they have on the exposed latency of top-level requests?'' In this paper, we present a large-scale study of errors in microservices. We answer the aforementioned question by analyzing 11 Billion RPCs covering 1,900 user-facing endpoints at the Uber serving requests of hundreds of millions of active users. To assess the latency impact of non-fatal errors, we develop a methodology that projects potential latency savings for a given request as if the time spent on failing APIs were eliminated. This estimator allows ranking and bubbling up those APIs that are worthy of further investigations, where the non-fatal errors likely resulted in operational inefficiencies. Finally, we employ our error detection and impact estimation techniques to pinpoint operational inefficiencies, which a) result in a tail latency reduction of a critical endpoint by 30% and b) offer insights into common inefficiency-introducing patterns.

More Like this