skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: It's not a Sprint, it's a Marathon: Stretching Multi-resource Burstable Performance in Public Clouds
During the past few years, all leading cloud providers introduced burstable instances that can sprint their performance for a limited period to address sudden workload variations. Despite the availability of burstable instances, there is no clear understanding of how to minimize the waste of resources by regulating their burst capacity to the workload requirements. This is especially true when it comes to non-CPU-intensive applications. In this paper, we investigate how to limit network and I/O usage to optimize the efficiency of the bursting process. We also study which resource shall be controlled to benefit both cloud providers and end-users. We design MRburst (Multi-Resource burstable performance scheduler) to automatically limit multiple resources (i.e., network, I/O, and CPU) and make the application comply with a user-defined service level objective (SLO) while minimizing wasted resources. MRburst is evaluated on Amazon EC2 using two multi-resource applications: an FTP server and a Ceph system. Experimental results show that MRburst outperforms state-of-the-art approaches by allowing instances to speed up their performance for up to 2.4 times longer period while meeting SLO.  more » « less
Award ID(s):
1838024 1756013
PAR ID:
10129636
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 20th International Middleware Conference (Middleware 2019)
Page Range / eLocation ID:
36 to 42
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Nearly all principal cloud providers now provide burstable instances in their offerings. The main attraction of this type of instance is that it can boost its performance for a limited time to cope with workload variations. Although burstable instances are widely adopted, it is not clear how to efficiently manage them to avoid waste of resources. In this paper, we use predictive data analytics to optimize the management of burstable instances. We design CEDULE+, a data-driven framework that enables efficient resource management for burstable cloud instances by analyzing the system workload and latency data. CEDULE+ selects the most profitable instance type to process incoming requests and controls CPU, I/O, and network usage to minimize the resource waste without violating Service Level Objectives (SLOs). CEDULE+ uses lightweight profiling and quantile regression to build a data-driven prediction model that estimates system performance for all combinations of instance type, resource type, and system workload. CEDULE+ is evaluated on Amazon EC2, and its efficiency and high accuracy are assessed through real-case scenarios. CEDULE+ predicts application latency with errors less than 10%, extends the maximum performance period of a burstable instance up to 2.4 times, and decreases deployment costs by more than 50%. 
    more » « less
  2. Infrastructure-as-a-Service cloud providers sell virtual machines that are only specified in terms of number of CPU cores, amount of memory, and I/O throughput. Performance-critical aspects such as cache sizes and memory latency are missing or reported in ways that make them hard to compare across cloud providers. It is difficult for users to adapt their application’s behavior to the available resources. In this work, we aim to increase the visibility that cloud users have into shared resources on public clouds. Specifically, we present CacheInspector , a lightweight runtime that determines the performance and allocated capacity of shared caches on multi-tenant public clouds. We validate CacheInspector ’s accuracy in a controlled environment, and use it to study the characteristics and variability of cache resources in the cloud, across time, instances, availability regions, and cloud providers. We show that CacheInspector ’s output allows cloud users to tailor their application’s behavior, including their output quality, to avoid suboptimal performance when resources are scarce. 
    more » « less
  3. Cloud computing has motivated renewed interest in resource allocation problems with new consumption models. A common goal is to share a resource, such as CPU or I/O bandwidth, among distinct users with different demand patterns as well as different quality of service requirements. To ensure these service requirements, cloud offerings often come with a service level agreement (SLA) between the provider and the users. A SLA specifies the amount of a resource a user is entitled to utilize. In many cloud settings, providers would like to operate resources at high utilization while simultaneously respecting individual SLAs. There is typically a trade-off between these two objectives; for example, utilization can be increased by shifting away resources from idle users to “scavenger” workload, but with the risk of the former then becoming active again. We study this fundamental tradeoff by formulating a resource allocation model that captures basic properties of cloud computing systems, including SLAs, highly limited feedback about the state of the system, and variable and unpredictable input sequences. Our main result is a simple and practical algorithm that achieves near-optimal performance on the above two objectives. First, we guarantee nearly optimal utilization of the resource even if compared with the omniscient offline dynamic optimum. Second, we simultaneously satisfy all individual SLAs up to a small error. The main algorithmic tool is a multiplicative weight update algorithm and a primal-dual argument to obtain its guarantees. We also provide numerical validation on real data to demonstrate the performance of our algorithm in practical applications. 
    more » « less
  4. High performance computing (HPC) is undergoing significant changes. The emerging HPC applications comprise both compute- and data-intensive applications. To meet the intense I/O demand from emerging data-intensive applications, burst buffers are deployed in production systems. Existing HPC schedulers are mainly CPU-centric. The extreme heterogeneity of hardware devices, combined with workload changes, forces the schedulers to consider multiple resources (e.g., burst buffers) beyond CPUs, in decision making. In this study, we present a multi-resource scheduling scheme named BBSched that schedules user jobs based on not only their CPU requirements, but also other schedulable resources such as burst buffer. BBSched formulates the scheduling problem into a multi-objective optimization (MOO) problem and rapidly solves the problem using a multi-objective genetic algorithm. The multiple solutions generated by BBSched enables system managers to explore potential tradeoffs among various resources, and therefore obtains better utilization of all the resources. The trace-driven simulations with real system workloads demonstrate that BBSched improves scheduling performance by up to 41% compared to existing methods, indicating that explicitly optimizing multiple resources beyond CPUs is essential for HPC scheduling. 
    more » « less
  5. null (Ed.)
    User-facing latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing of compute resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO violations via dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16Å~ while reducing the overall requested CPU limit by up to 62%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11Å~. 
    more » « less