Transient computing has become popular in public cloud environments for running delay-insensitive batch and data processing applications at low cost. Since transient cloud servers can be revoked at any time by the cloud provider, they are considered unsuitable for running interactive application such as web services. In this paper, we present VM deflation as an alternative mechanism to server preemption for reclaiming resources from transient cloud servers under resource pressure. Using real traces from top-tier cloud providers, we show the feasibility of using VM deflation as a resource reclamation mechanism for interactive applications in public clouds. We show how current hypervisor mechanisms can be used to implement VM deflation and present cluster deflation policies for resource management of transient and on-demand cloud VMs. Experimental evaluation of our deflation system on a Linux cluster shows that microservice-based applications can be deflated by up to 50% with negligible performance overhead. Our cluster-level deflation policies allow overcommitment levels as high as 50%, with less than a 1% decrease in application throughput, and can enable cloud platforms to increase revenue by 30%
more »
« less
Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in Clouds
Dynamically reallocating computing resources to handle bursty workloads is a common practice for web applications (e.g., e-commerce) in clouds. However, our empirical analysis on a standard n-tier benchmark application (RUBBoS) shows that simply scaling an n-tier application by reallocating hardware resources without fast adapting soft resources (e.g., server threads, connections) may lead to large response time fluctuations. This is because soft resources control the workload concurrency of component servers in the system: adding or removing hardware resources such as Virtual Machines (VMs) can implicitly change the workload concurrency of dependent servers, causing either under- or over-utilization of the critical hardware resource in the system. To quickly identify the optimal soft resource allocation of each server in the system and stabilize response time fluctuation, we propose a novel Scatter-Concurrency-Throughput (SCT) model based on the monitoring of each server's real-time concurrency and throughput. We then implement a Concurrency-aware system Scaling (ConScale) framework which integrates the SCT model to fast adapt the soft resource allocations of key servers during the system scaling process. Our experiments using six realistic bursty workload traces show that ConScale can effectively mitigate the response time fluctuations of the target web application compared to the state-of-the-art cloud scaling strategies such as EC2-AutoScaling.
more »
« less
- Award ID(s):
- 2000681
- PAR ID:
- 10212864
- Date Published:
- Journal Name:
- 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Page Range / eLocation ID:
- 368 to 377
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Deep learning models are increasingly used for end-user applications, supporting both novel features such as facial recognition, and traditional features, e.g. web search. To accommodate high inference throughput, it is common to host a single pre-trained Convolutional Neural Network (CNN) in dedicated cloud-based servers with hardware accelerators such as Graphics Processing Units (GPUs). However, GPUs can be orders of magnitude more expensive than traditional Central Processing Unit (CPU) servers. These resources could also be under-utilized facing dynamic workloads, which may result in inflated serving costs. One potential way to alleviate this problem is by allowing hosted models to share the underlying resources, which we refer to as multi-tenant inference serving. One of the key challenges is maximizing the resource efficiency for multi-tenant serving given hardware with diverse characteristics, models with unique response time Service Level Agreement (SLA), and dynamic inference workloads. In this paper, we present PERSEUS, a measurement framework that provides the basis for understanding the performance and cost trade-offs of multi-tenant model serving. We implemented PERSEUS in Python atop a popular cloud inference server called Nvidia TensorRT Inference Server. Leveraging PERSEUS, we evaluated the inference throughput and cost for serving various models and demonstrated that multi-tenant model serving led to up to 12% cost reduction.more » « less
-
Data center workloads are composed of multiresource jobs requiring a variety of computational resources including CPU cores, memory, disk space, and hardware accelerators. Mod- ern servers can run multiple jobs in parallel, but a set of jobs can only run in parallel if the server has sufficient resources to satisfy the demands of each job. It is generally hard to find sets of jobs that perfectly utilize all server resources, and choosing the wrong set of jobs can lead to low resource uti- lization. This raises the question of how to allocate resources across a stream of arriving multiresource jobs to minimize the mean response time across jobs — the mean time from when a job arrives to the system until it is complete. Current policies for scheduling multiresource jobs are com- plex to analyze and hard to implement. We propose a class of simple policies, called Markovian Service Rate (MSR) policies. We show that the class of MSR policies is throughput- optimal, in that if a policy exists that can stabilize the sys- tem, then an MSR policy exists that stabilizes the system. We derive bounds on the mean response time under an MSR policy, and show how our bounds can be used to choose an MSR policy that minimizes mean response time.more » « less
-
null (Ed.)Fast networks and the desire for high resource utilization in data centers and the cloud have driven disaggregation. Application compute is separated from storage, but this leads to high overheads when data must move over the network for simple operations on it. Alternatively, systems could allow applications to run application logic within storage via user-defined functions. Unfortunately, this ties provisioning and utilization of storage and compute resources together again. We present a new approach to executing storage-level functions in an in-memory key-value store that avoids this problem by dynamically deciding where to execute functions over data. Users write storage functions that are logically decoupled from storage, but storage servers choose where to run invocations of these functions physically. By using a server-internal cost model and observing function execution, servers choose to directly run inexpensive functions, while preferring to execute functions with high CPU-cost at client machines. We show that with this approach storage servers can reduce network request processing costs, avoid server compute bottlenecks, and improve aggregate storage system throughput. We realize our approach on an in-memory key-value store that executes 3.2 million strict serializable user-defined storage functions per second with 100 us response times. When running a mix of logic from different applications, it provides throughput better than running that logic purely at storage servers (85% more) or purely at clients (10% more). For our workloads, it also reduces latency (up to 2x) and transactional aborts (up to 33%) over pure client-side execution.more » « less
-
Serverless computing is a promising new event- driven programming model that was designed by cloud vendors to expedite the development and deployment of scalable web services on cloud computing systems. Using the model, developers write applications that consist of simple, independent, stateless functions that the cloud invokes on-demand (i.e. elastically), in response to system-wide events (data arrival, messages, web requests, etc.). In this work, we present STOIC (Serverless TeleOperable HybrId Cloud), an application scheduling and deployment system that extends the serverless model in two ways. First, it uses the model in a distributed setting and schedules application functions across multiple cloud systems. Second, STOIC sup- ports serverless function execution using hardware acceleration (e.g. GPU resources) when available from the underlying cloud system. We overview the design and implementation of STOIC and empirically evaluate it using real-world machine learning applications and multi-tier (e.g. edge-cloud) deployments. We find that STOIC’s combined use of edge and cloud resources is able to outperform using either cloud in isolation for the applications and datasets that we consider.more » « less
An official website of the United States government

