NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

iPrism: Characterize and Mitigate Risk by Quantifying Change in Escape Routes

https://doi.org/10.1109/DSN58291.2024.00027

Cui, Shengkun; Jha, Saurabh; Chen, Ziheng; Kalbarczvk, Zbigniew T; Iyer, Ravishankar K (June 2024, Institute of Electrical and Electronics Engineers)
nd (Ed.)
This paper addresses the challenge of ensuring the safety of autonomous vehicles (AVs, also called ego actors) in realworld scenarios where AVs are constantly interacting with other actors. To address this challenge, we introduce iPrism which incorporates a new risk metric – the Safety-Threat Indicator (STI). Inspired by how experienced human drivers proactively mitigate hazardous situations, STI quantifies actor-related risks by measuring the changes in escape routes available to the ego actor. To actively mitigate the risk quantified by STI and avert accidents, iPrism also incorporates a reinforcement learning (RL) algorithm (referred to as the Safety-hazard Mitigation Controller (SMC)) that learns and implements optimal risk mitigation policies. Our evaluation of the success of the SMC is based on over 4800 NHTSA-based safety-critical scenarios. The results show that (i) STI provides up to 4.9× longer lead-time for-mitigating-accidents compared to widely-used safety and planner-centric metrics, (ii) SMC significantly reduces accidents by 37% to 98% compared to a baseline Learning-by-Cheating (LBC) agent, and (iii) in comparison with available state-of-the-art safety hazard mitigation agents, SMC prevents up to 72.7% of accidents that the selected agents are unable to avoid. All code, model weights, and evaluation scenarios and pipelines used in this paper are available at: https://zenodo.org/doi/10.5281/ zenodo.10279653.
more » « less
Full Text Available
Mutiny! How Does Kubernetes Fail, and What Can We Do About It?

https://doi.org/10.1109/DSN58291.2024.00016

Barletta, Marco; Cinque, Marcello; Di_Martino, Catello; Kalbarczyk, Zbigniew T; Iyer, Ravishankar K (April 2024, arXiv)

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/over provisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
more » « less
Full Text Available
True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center

Yang, Limin; Chen, Zhi; Wang, Chenkai; Zhang, Zhenning; Booma, Sushruth; Cao, Phuong; Adam, Constantin; Withers, Alex; Kalbarczyk, Zbigniew; Iyer, Ravishankar K; et al (August 2024, Proceedings of The 33rd USENIX Security Symposium (USENIX Security))

Full Text Available
True Attacks, Attack Attempts, or Benign Triggers? An Empirical Measurement of Network Alerts in a Security Operations Center

Yang, Limin; Chen, Zhi; Wang, Chenkai; Zhang, Zhenning; Booma, Sushruth; Cao, Phuong; Adam, Constantin; Withers, Alex; Kalbarczyk, Zbigniew; Iyer, Ravishankar K; et al (August 2024, 33rd USENIX Security Symposium (USENIX Security 24))

Full Text Available
AWARE: Automate Workload Autoscaling with Reinforcement Learning in Production Cloud Systems

Qiu, Haoran; Mao, Weichao; Wang, Chen; Franke, Hubertus; Yousseff, Alaa; Kalbarczyk, Zbigniew T.; Iyer, Ravishankar K (July 2023, 2023 USENIX Annual Technical Conference (USENIX ATC 23))

Workload autoscaling is widely used in public and private cloud systems to maintain stable service performance and save resources. However, it remains challenging to set the optimal resource limits and dynamically scale each workload at runtime. Reinforcement learning (RL) has recently been proposed and applied in various systems tasks, including resource management. In this paper, we first characterize the state-of-the-art RL approaches for workload autoscaling in a public cloud and point out that there is still a large gap in taking the RL advances to production systems. We then propose AWARE, an extensible framework for deploying and managing RL-based agents in production systems. AWARE leverages meta-learning and bootstrapping to (a) automatically and quickly adapt to different workloads, and (b) provide safe and robust RL exploration. AWARE provides a common OpenAI Gym-like RL interface to agent developers for easy integration with different systems tasks. We illustrate the use of AWARE in the case of workload autoscaling. Our experiments show that AWARE adapts a learned autoscaling policy to new workloads 5.5x faster than the existing transfer-learning-based approach and provides stable online policy-serving performance with less than 3.6% reward degradation. With bootstrapping, AWARE helps achieve 47.5% and 39.2% higher CPU and memory utilization while reducing SLO violations by a factor of 16.9x during policy training.
more » « less
Full Text Available
SIMPPO: a scalable and incremental online learning framework for serverless resource management

https://doi.org/10.1145/3542929.3563475

Qiu, Haoran; Mao, Weichao; Patke, Archit; Wang, Chen; Franke, Hubertus; Kalbarczyk, Zbigniew T.; Başar, Tamer; Iyer, Ravishankar K. (November 2022, Proceedings of the 13th ACM Symposium on Cloud Computing (SoCC 2022))

Serverless Function-as-a-Service (FaaS) offers improved programmability for customers, yet it is not server-“less” and comes at the cost of more complex infrastructure management (e.g., resource provisioning and scheduling) for cloud providers. To maintain function service-level objectives (SLOs) and improve resource utilization efficiency, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to rule-based solutions with heuristics, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. Despite the initial success of applying RL, we first show in this paper that the state-of-the-art single-agent RL algorithm (S-RL) suffers up to 4.8x higher p99 function latency degradation on multi-tenant serverless FaaS platforms compared to isolated environments and is unable to converge during training. We then design and implement a scalable and incremental multi-agent RL framework based on Proximal Policy Optimization (SIMPPO). Our experiments on widely used serverless benchmarks demonstrate that in multi-tenant environments, SIMPPO enables each RL agent to efficiently converge during training and provides online function latency performance comparable to that of S-RL trained in isolation (which we refer to as the baseline for assessing RL performance) with minor degradation (<9.2%). In addition, SIMPPO reduces the p99 function latency by 4.5x compared to S-RL in multi-tenant cases.
more » « less
Full Text Available
Reinforcement learning for resource management in multi-tenant serverless platforms

https://doi.org/10.1145/3517207.3526971

Qiu, Haoran; Mao, Weichao; Patke, Archit; Wang, Chen; Franke, Hubertus; Kalbarczyk, Zbigniew T.; Başar, Tamer; Iyer, Ravishankar K. (April 2022, EuroMLSys 2022 - Proceedings of the 2nd European Workshop on Machine Learning and Systems)

Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases.
more » « less
Full Text Available
BayesPerf: minimizing performance monitoring errors using Bayesian statistics

https://doi.org/10.1145/3445814.3446739

Banerjee, Subho S.; Jha, Saurabh; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. (April 2021, The 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. (ASPLOS ‘21))
null (Ed.)
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.
more » « less
Full Text Available
Is Function-as-a-Service a Good Fit for Latency-Critical Services?

https://doi.org/10.1145/3493651.3493666

Qiu, Haoran; Jha, Saurabh; Banerjee, Subho S.; Patke, Archit; Wang, Chen; Hubertus, Franke; Kalbarczyk, Zbigniew T.; Iyer, Ravishankar K. (December 2021, WoSC '21: Proceedings of the Seventh International Workshop on Serverless Computing (WoSC7) 2021)

Function-as-a-Service (FaaS) is becoming an increasingly popular cloud-deployment paradigm for serverless computing that frees application developers from managing the infrastructure. At the same time, it allows cloud providers to assert control in workload consolidation, i.e., co-locating multiple containers on the same server, thereby achieving higher server utilization, often at the cost of higher end-to-end function request latency. Interestingly, a key aspect of serverless latency management has not been well studied: the trade-off between application developers' latency goals and the FaaS providers' utilization goals. This paper presents a multi-faceted, measurement-driven study of latency variation in serverless platforms that elucidates this trade-off space. We obtained production measurements by executing FaaS benchmarks on IBM Cloud and a private cloud to study the impact of workload consolidation, queuing delay, and cold starts on the end-to-end function request latency. We draw several conclusions from the characterization results. For example, increasing a container's allocated memory limit from 128 MB to 256 MB reduces the tail latency by 2× but has 1.75× higher power consumption and 59% lower CPU utilization.
more » « less
Full Text Available
Delay sensitivity-driven congestion mitigation for HPC systems

https://doi.org/10.1145/3447818.3460362

Patke, Archit; Jha, Saurabh; Qiu, Haoran; Brandt, Jim; Gentile, Ann; Greenseid, Joe; Kalbarczyk, Zbigniew; Iyer, Ravishankar K. (June 2021, Proceedings of The 35th ACM International Conference on Supercomputing (ICS ‘21))
null (Ed.)
Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%.
more » « less
Full Text Available

« Prev Next »

Search for: All records