skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting Ephemeral Optical Events with OpTel
Degradation or failure events in optical backbone networks affect the service level agreements for cloud services. It is critical to detect and troubleshoot these events promptly to minimize their impact. Existing telemetry systems rely on arcane tools (e.g., SNMP) and vendor-specific controllers to collect optical data, which affects both the flexibility and scale of these systems. As a result, they fail to collect the required data on time to detect and troubleshoot degradation or failure events in a timely fashion. This paper presents the design and implementation of OpTel, an optical telemetry system, that uses a centralized vendor-agnostic controller to collect optical data in a streaming fashion. More specifically, it offers flexible vendor-agnostic interfaces between the optical devices and the controller and offloads data-management tasks (e.g., creating a queryable database) from the devices to the controller. As a result, OpTel enables the collection of fine-grained optical telemetry data at the one-second granularity. It has been running in Tencent's optical backbone network for the past six months. The fine-grained data collection enables the detection of short-lived events (i.e., ephemeral events). Compared to existing telemetry systems, OpTel accurately detects 2x more optical events. It also enables troubleshooting of these optical events in a few seconds, which is orders of magnitude faster than the state-of-the-art.  more » « less
Award ID(s):
2003257
PAR ID:
10346332
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
19th USENIX Symposium on Networked Systems Design and Implementation
Page Range / eLocation ID:
339-353
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems. We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors' memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple. We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to 16x. 
    more » « less
  2. Logging is a universal approach to recording important events in system workflows of distributed systems. Current log analysis tools ignore the semantic knowledge that is key to workflow construction and analysis. In addition, they focus on infrastructure-level distributed systems. Because of fundamental differences in log features, they are ineffective in distributed data analytics systems. This paper proposes IntelLog, a semantic-aware non-intrusive workflow reconstruction tool for distributed data analytics systems. It is capable of building hierarchical relationships between components and events from logs generated by the targeted systems with little or even no domain knowledge. Leveraging natural language processing, IntelLog automatically extracts and formats semantic information in each log message, including system events, identifiers, locality information, and metrics values. It builds a graph to represent the hierarchical relationship of components in the targeted system via nomenclature conventions. We implement IntelLog for Hadoop MapReduce, Spark and Tez. Evaluation results show that IntelLog provides a fine-grained view of the system workflows with semantics. It outperforms existing tools in automatically detecting anomalies caused by real-world problems, misconfigurations and system bugs. Users can query the formatted semantic knowledge to understand and further troubleshoot the systems. 
    more » « less
  3. Logging is a universal approach to recording important events in system workflows of distributed systems. Current log analysis tools ignore the semantic knowledge that is key to workflow construction and analysis. In addition, they focus on infrastructure-level distributed systems. Because of fundamental differences in log features, they are ineffective in distributed data analytics systems. This paper proposes IntelLog, a semantic-aware non-intrusive workflow reconstruction tool for distributed data analytics systems. It is capable of building hierarchical relationships between components and events from logs generated by the targeted systems with little or even no domain knowledge. Leveraging natural language processing, IntelLog automatically extracts and formats semantic information in each log message, including system events, identifiers, locality information, and metrics values. It builds a graph to represent the hierarchical relationship of components in the targeted system via nomenclature conventions. We implement IntelLog for Hadoop MapReduce, Spark and Tez. Evaluation results show that IntelLog provides a fine-grained view of the system workflows with semantics. It outperforms existing tools in automatically detecting anomalies caused by real-world problems, misconfigurations and system bugs. Users can query the formatted semantic knowledge to understand and further troubleshoot the systems. 
    more » « less
  4. Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail. In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments. Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools. 
    more » « less
  5. Telemetry systems are widely used to collect data from distributed endpoints, analyze data in conjunction to gain valuable insights, and store data for historical analytics. These systems consist of four stages (Figure 1): collection, transmission, analysis, and storage. Collectors at the endpoint collect various types of data, which is then transmitted to a central server for analysis. This data is used for multiple downstream tasks, such as dashboard monitoring and anomaly detection. Finally, this data is stored in long-term storage to aid retrospective analytics and debugging. 
    more » « less