skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting Performance Anomalies in Cloud Platform Applications
We present Roots, a full-stack monitoring and analysis system for performance anomaly detection and bottleneck identification in cloud platform-as-a-service (PaaS) systems. Roots facilitates application performance monitoring as a core capability of PaaS clouds, and relieves the developers from having to instrument application code. Roots tracks HTTP/S requests to hosted cloud applications and their use of PaaS services. To do so it employs lightweight monitoring of PaaS service interfaces. Roots processes this data in the background using multiple statistical techniques that in combination detect performance anomalies (i.e. violations of service-level objectives). For each anomaly, Roots determines whether the event was caused by a change in the request workload or by a performance bottleneck in a PaaS service. By correlating data collected across different layers of the PaaS, Roots is able to trace high-level performance anomalies to bottlenecks in specific components in the cloud platform. We implement Roots using the AppScale PaaS and evaluate its overhead and accuracy.  more » « less
Award ID(s):
1703560
PAR ID:
10057652
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Cloud Computing
ISSN:
2168-7161
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We introduce a novel methodology for anomaly detection in time-series data. The method uses persistence diagrams and bottleneck distances to identify anomalies. Specifically, we generate multiple predictors by randomly bagging the data (reference bags), then for each data point replacing the data point for a randomly chosen point in each bag (modified bags). The predictors then are the set of bottleneck distances for the reference/modified bag pairs. We prove the stability of the predictors as the number of bags increases. We apply our methodology to traffic data and measure the performance for identifying known incidents. 
    more » « less
  2. Today's serverless provides "function-as-a-service" with dynamic scaling and fine-grained resource charging, enabling new cloud applications. Serverless functions are invoked as a best-effort service. We propose an extension to serverless, called real-time serverless that provides an invocation rate guarantee, a service-level objective (SLO) specified by the application, and delivered by the underlying implementation. Real-time serverless allows applications to guarantee real-time performance. We study real-time serverless behavior analytically and empirically to characterize its ability to support bursty, real-time cloud and edge applications efficiently. Finally, we use a case study, traffic monitoring, to illustrate the use and benefits of real-time serverless, on our prototype implementation. 
    more » « less
  3. Abstract The CMS detector is a general-purpose apparatus that detects high-energy collisions produced at the LHC. Online data quality monitoring of the CMS electromagnetic calorimeter is a vital operational tool that allows detector experts to quickly identify, localize, and diagnose a broad range of detector issues that could affect the quality of physics data. A real-time autoencoder-based anomaly detection system using semi-supervised machine learning is presented enabling the detection of anomalies in the CMS electromagnetic calorimeter data. A novel method is introduced which maximizes the anomaly detection performance by exploiting the time-dependent evolution of anomalies as well as spatial variations in the detector response. The autoencoder-based system is able to efficiently detect anomalies, while maintaining a very low false discovery rate. The performance of the system is validated with anomalies found in 2018 and 2022 LHC collision data. In addition, the first results from deploying the autoencoder-based system in the CMS online data quality monitoring workflow during the beginning of Run 3 of the LHC are presented, showing its ability to detect issues missed by the existing system. 
    more » « less
  4. In the last few years, Cloud computing technology has benefited many organizations that have embraced it as a basis for revamping the IT infrastructure. Cloud computing utilizes Internet capabilities in order to use other computing resources. Amazon Web Services (AWS) is one of the most widely used cloud providers that leverages the endless computing capabilities that the cloud technology has to offer. AWS is continuously evolving to offer a variety of services, including but not limited to, infrastructure as a service (IaaS), platform as a service (PaaS) and packaged software as a service. Among the other important services offered by AWS is Video Surveillance as a Service (VSaaS) that is a hosted cloud-based video surveillance service. Even though this technology is complex and widely used, some security experts have pointed out that some of its vulnerabilities can be exploited in launching attacks aimed at cloud technologies. In this paper, we present a holistic security analysis of cloud-based video surveillance systems by examining the vulnerabilities, threats, and attacks that these technologies are susceptible to. We illustrate our findings by implementing several of these attacks on a test bed representing an AWS-based video surveillance system. The main contributions of our paper are: (1) we provided a holistic view of the security model of cloud based video surveillance summarizing the underlying threats, vulnerabilities and mitigation techniques (2) we proposed a novel taxonomy of attacks targeting such systems (3) we implemented several related attacks targeting cloud-based video surveillance system based on an AWS test environment and provide some guidelines for attack mitigation. The outcome of the conducted experiments showed that the vulnerabilities of the Internet Protocol (IP) and other protocols granted access to unauthorized VSaaS files. We aim that our proposed work on the security of cloud-based video surveillance systems will serve as a reference for cybersecurity researchers and practitioners who aim to conduct research in this field. 
    more » « less
  5. Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail. In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments. Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools. 
    more » « less