skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Price Is (Not) Right: Reflections on Pricing for Transient Cloud Servers
Amazon introduced spot instances in December 2009, enabling “customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price.” Amazon’s real-time computational spot market was novel in multiple respects. For example, it was the first (and to date only) large-scale public implementation of market-based resource allocation based on dynamic pricing after decades of research, and it provided users with useful information, control knobs, and options for optimizing the cost of running cloud applications. Spot instances also introduced the concept of transient cloud servers derived from variable idle capacity that cloud platforms could revoke at any time. Transient servers have since become central to efficient resource management of modern clusters and clouds. As a result, Amazon’s spot market was the motivation for substantial research over the past decade. Yet, in November 2017, Amazon effectively ended its real-time spot market by announcing that users no longer needed to place bids and that spot prices will “...adjust more gradually, based on longer-term trends in supply and demand.” The changes made spot instances more similar to the fixed-price transient servers offered by other cloud platforms. Unfortunately, while these changes made spot instances less complex, they eliminated many benefits to sophisticated users in optimizing their applications. This paper provides a retrospective on Amazon’s real-time spot market, including its advantages and disadvantages for allocating transient servers compared to current fixed-price approaches. We also discuss some fundamental problems with Amazon’s spot market, which we identified in prior work (from 2016), that predicted its eventual end. We then discuss potential options for allocating transient servers that combine the advantages of Amazon’s real-time spot market, while also addressing the problems that likely led to its elimination.  more » « less
Award ID(s):
1802523
PAR ID:
10145403
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
2019 28th International Conference on Computer Communication and Networks (ICCCN)
Page Range / eLocation ID:
1 to 9
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cloud platforms offer the same VMs under many purchasing options that specify different costs and time commitments, such as on-demand, reserved, sustained-use, scheduled reserve, transient, and spot block. In general, the stronger the commitment, i.e., longer and less flexible, the lower the price. However, longer and less flexible time commitments can increase cloud costs for users if future workloads cannot utilize the VMs they committed to buying. Large cloud customers often find it challenging to choose the right mix of purchasing options to reduce their long-term costs, while retaining the ability to adjust capacity up and down in response to workload variations.To address the problem, we design policies to optimize long-term cloud costs by selecting a mix of VM purchasing options based on short- and long-term expectations of workload utilization. We consider a batch trace spanning 4 years from a large shared cluster for a major state University system that includes 14k cores and 60 million job submissions, and evaluate how these jobs could be judiciously executed using cloud servers using our approach. Our results show that our policies incur a cost within 41% of an optimistic optimal offline approach, and 50% less than solely using on-demand VMs. 
    more » « less
  2. null (Ed.)
    Cloud providers offer instances with similar compute capabilities (for example, instances with different generations of GPUs like K80s, P100s, V100s) across many regions, availability zones, and on-demand and spot markets, with prices governed independently by individual supplies and demands. In this paper, using machine learning model training as an example application, we explore the potential cost reductions possible by leveraging this cross-cloud instance market. We present quantitative results on how the prices of cloud instances change with time, and how total costs can be decreased by considering this dynamic pricing market. Our preliminary experiments show that a) the optimal instance choice for a model is dependent on both the objective (e.g., cost, time, or combination) and the model’s performance characteristics, b) the cost of moving training jobs between instances is cheap, c) jobs do not need to be preempted more frequently than once a day to leverage the benefits from spot instance price variations, and d) the cost of training a model can be decreased by as much as 3.5× compared to a static policy. We also look at contexts where users specify higherlevel objectives over collections of jobs, show examples of policies for these contexts, and discuss additional challenges involved in making these cost reductions viable. 
    more » « less
  3. During the past few years, all leading cloud providers introduced burstable instances that can sprint their performance for a limited period to address sudden workload variations. Despite the availability of burstable instances, there is no clear understanding of how to minimize the waste of resources by regulating their burst capacity to the workload requirements. This is especially true when it comes to non-CPU-intensive applications. In this paper, we investigate how to limit network and I/O usage to optimize the efficiency of the bursting process. We also study which resource shall be controlled to benefit both cloud providers and end-users. We design MRburst (Multi-Resource burstable performance scheduler) to automatically limit multiple resources (i.e., network, I/O, and CPU) and make the application comply with a user-defined service level objective (SLO) while minimizing wasted resources. MRburst is evaluated on Amazon EC2 using two multi-resource applications: an FTP server and a Ceph system. Experimental results show that MRburst outperforms state-of-the-art approaches by allowing instances to speed up their performance for up to 2.4 times longer period while meeting SLO. 
    more » « less
  4. Harguess, Joshua D; Bastian, Nathaniel D; Pace, Teresa L (Ed.)
    Outsourcing computational tasks to the cloud offers numerous advantages, such as availability, scalability, and elasticity. These advantages are evident when outsourcing resource-demanding Machine Learning (ML) applications. However, cloud computing presents security challenges. For instance, allocating Virtual Machines (VMs) with varying security levels onto commonly shared servers creates cybersecurity and privacy risks. Researchers proposed several cryptographic methods to protect privacy, such as Multi-party Computation (MPC). Attackers unfortunately can still gain unauthorized access to users’ data if they successfully compromise a specific number of the participating MPC nodes. Cloud Service Providers (CSPs) can mitigate the risk of such attacks by distributing the MPC protocol over VMs allocated to separate physical servers (i.e., hypervisors). On the other hand, underutilizing cloud servers increases operational and resource costs, and worsens the overhead of MPC protocols. In this ongoing work, we address the security, communication and computation overheads, and performance limitations of MPC. We model this multi-objective optimization problem using several approaches, including but not limited to, zero-sum and non-zero-sum games. For example, we investigate Nash Equilibrium (NE) allocation strategies that reduce potential security risks, while minimizing response time and performance overhead, and/or maximizing resource usage. 
    more » « less
  5. null (Ed.)
    Cloud users can significantly reduce their cost (by up to 60%) by reserving virtual machines (VMs) for long periods (1 or 3 years) rather than acquiring them on demand. Unfortunately, reserving VMs exposes users to demand risk that can increase cost if their expected future demand does not materialize. Since accurately forecasting demand over long periods is challenging, users often limit their use of reserved VMs. To mitigate demand risk, Amazon operates a Reserved Instance Marketplace (RIM) where users may publicly list the remaining time on their VM reservations for sale at a price they set. The RIM enables users to limit demand risk by either selling VM reservations if their demand changes, or purchasing variable- and shorter-term VM reservations that better match their demand forecast horizon. Clearly, the RIM’s potential to mitigate demand risk is a function of its price characteristics. However, to the best of our knowledge, historical RIM prices have neither been made publicly available nor analyzed. To address the problem, we have been monitoring and archiving RIM prices for 1.75 years across all 69 availability zones and 22 regions in Amazon’s Elastic Compute Cloud (EC2). This paper provides a first look at this data and its implications for cost-effectively provisioning cloud infrastructure. 
    more » « less