skip to main content


Title: Analysis and Exploitation of Dynamic Pricing in the Public Cloud for ML Training
Cloud providers offer instances with similar compute capabilities (for example, instances with different generations of GPUs like K80s, P100s, V100s) across many regions, availability zones, and on-demand and spot markets, with prices governed independently by individual supplies and demands. In this paper, using machine learning model training as an example application, we explore the potential cost reductions possible by leveraging this cross-cloud instance market. We present quantitative results on how the prices of cloud instances change with time, and how total costs can be decreased by considering this dynamic pricing market. Our preliminary experiments show that a) the optimal instance choice for a model is dependent on both the objective (e.g., cost, time, or combination) and the model’s performance characteristics, b) the cost of moving training jobs between instances is cheap, c) jobs do not need to be preempted more frequently than once a day to leverage the benefits from spot instance price variations, and d) the cost of training a model can be decreased by as much as 3.5× compared to a static policy. We also look at contexts where users specify higherlevel objectives over collections of jobs, show examples of policies for these contexts, and discuss additional challenges involved in making these cost reductions viable.  more » « less
Award ID(s):
1651570
NSF-PAR ID:
10213411
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
VLDB DISPA Workshop 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Amazon introduced spot instances in December 2009, enabling “customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price.” Amazon’s real-time computational spot market was novel in multiple respects. For example, it was the first (and to date only) large-scale public implementation of market-based resource allocation based on dynamic pricing after decades of research, and it provided users with useful information, control knobs, and options for optimizing the cost of running cloud applications. Spot instances also introduced the concept of transient cloud servers derived from variable idle capacity that cloud platforms could revoke at any time. Transient servers have since become central to efficient resource management of modern clusters and clouds. As a result, Amazon’s spot market was the motivation for substantial research over the past decade. Yet, in November 2017, Amazon effectively ended its real-time spot market by announcing that users no longer needed to place bids and that spot prices will “...adjust more gradually, based on longer-term trends in supply and demand.” The changes made spot instances more similar to the fixed-price transient servers offered by other cloud platforms. Unfortunately, while these changes made spot instances less complex, they eliminated many benefits to sophisticated users in optimizing their applications. This paper provides a retrospective on Amazon’s real-time spot market, including its advantages and disadvantages for allocating transient servers compared to current fixed-price approaches. We also discuss some fundamental problems with Amazon’s spot market, which we identified in prior work (from 2016), that predicted its eventual end. We then discuss potential options for allocating transient servers that combine the advantages of Amazon’s real-time spot market, while also addressing the problems that likely led to its elimination. 
    more » « less
  2. Distributed Deep Neural Network (DDNN) training on cloud spot instances is increasingly compelling as it can significantly save the user budget. To handle unexpected instance revocations, provisioning a heterogeneous cluster using the asynchronous parallel mechanism becomes the dominant method for DDNN training with spot instances. However, blindly provisioning a cluster of spot instances can easily result in unpredictable DDNN training performance, mainly because bottlenecks occur on the parameter server network bandwidth and PCIe bandwidth resources, as well as the inadequate cluster heterogeneity. To address the challenges above, we propose spotDNN, a heterogeneity-aware spot instance provisioning framework that provides predictable performance for DDNN training in the cloud. By explicitly considering the contention for bottle-neck resources, we first build an analytical performance model of DDNN training in heterogeneous clusters. It leverages the weighted average batch size and convergence coefficient to quantify the DDNN training loss in heterogeneous clusters. Through a lightweight workload profiling, we further design a cost-efficient instance provisioning strategy which incorporates the bounds calculation and sliding window techniques to effectively guarantee the training performance service level objectives (SLOs). We have implemented a prototype of spotDNN and conducted extensive experiments on Amazon EC2. Experiment results show that spotDNN can deliver predictable DDNN training performance while reducing the monetary cost by up to 68:1% compared to the existing solutions, yet with acceptable runtime overhead. 
    more » « less
  3. Abstract

    Cool materials and rooftop vegetation help achieve urban heating mitigation as they can reduce building cooling demands. This study assesses the cooling potential of different mitigation technologies using Weather Research and Forecasting (WRF)- taking case of a tropical coastal climate in the Kolkata Metropolitan Area. The model was validated using data from six meteorological sites. The cooling potential of eight mitigation scenarios was evaluated for: three cool roofs, four green roofs, and their combination (cool-city). The sensible heat, latent heat, heat storage, 2-m ambient temperature, surface temperature, air temperature, roof temperature, and urban canopy temperature was calculated. The effects on the urban boundary layer were also investigated.

    The different scenarios reduced the daytime temperature of various urban components, and the effect varied nearly linearly with increasing albedo and green roof fractions. For example, the maximum ambient temperature decreased by 3.6 °C, 0.9 °C, and 1.4 °C for a cool roof with 85% albedo, 100% rooftop vegetation, and their combination.

    The cost of different mitigation scenarios was assumed to depend on the construction options, location, and market prices. The potential for price per square meter and corresponding temperature decreased was related to one another. Recognizing the complex relationship between scenarios and construction options, the reduction in the maximum and minimum temperature across different cool and green roof cases were used for developing the cost estimates. This estimate thus attempted a summary of the price per degree of cooling for the different potential technologies.

    Higher green fraction, cool materials, and their combination generally reduced winds and enhanced buoyancy. The surface changes alter the lower atmospheric dynamics such as low-level vertical mixing and a shallower boundary layer and weakened horizontal convective rolls during afternoon hours. Although cool materials offer the highest temperature reductions, the cooling resulting from its combination and a green roof strategy could mitigate or reverse the summertime heat island effect. The results highlight the possibilities for heat mitigation and offer insight into the different strategies and costs for mitigating the urban heating and cooling demands.

     
    more » « less
  4. Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work. 
    more » « less
  5. Enabling participation of demand-side flexibility in electricity markets is key to improving power system resilience and increasing the penetration of renewable generation. In this work we are motivated by the curtailment of near-zero-marginal-cost renewable resources during periods of oversupply, a particularly important cause of inefficient generation dispatch. Focusing on shiftable load in a multi-interval economic dispatch setting, we show that incompatible incentives arise for loads in the standard market formulation. While the system's overall efficiency increases from dispatching flexible demand, the overall welfare of loads can decrease as a result of higher spot prices. We propose a market design to address this incentive issue. Specifically, by imposing a small number of additional constraints on the economic dispatch problem, we obtain a mechanism that guarantees individual rationality for all market participants while simultaneously obtaining a more efficient dispatch. Our formulation leads to a natural definition of a uniform, time-varying flexibility price that is paid to loads to incentivize flexible bidding. We provide theoretical guarantees and empirically validate our model with simulations on real-world generation data from California Independent System Operator (CAISO). 
    more » « less