spotDNN: Provisioning Spot Instances for Predictable Distributed DNN Training in the Cloud

Ruitao Shang, Zhuoyan Bai

Distributed Deep Neural Network (DDNN) training on cloud spot instances is increasingly compelling as it can significantly save the user budget. To handle unexpected instance revocations, provisioning a heterogeneous cluster using the asynchronous parallel mechanism becomes the dominant method for DDNN training with spot instances. However, blindly provisioning a cluster of spot instances can easily result in unpredictable DDNN training performance, mainly because bottlenecks occur on the parameter server network bandwidth and PCIe bandwidth resources, as well as the inadequate cluster heterogeneity. To address the challenges above, we propose spotDNN, a heterogeneity-aware spot instance provisioning framework that provides predictable performance for DDNN training in the cloud. By explicitly considering the contention for bottle-neck resources, we first build an analytical performance model of DDNN training in heterogeneous clusters. It leverages the weighted average batch size and convergence coefficient to quantify the DDNN training loss in heterogeneous clusters. Through a lightweight workload profiling, we further design a cost-efficient instance provisioning strategy which incorporates the bounds calculation and sliding window techniques to effectively guarantee the training performance service level objectives (SLOs). We have implemented a prototype of spotDNN and conducted extensive experiments on Amazon EC2. Experiment results show that spotDNN can deliver predictable DDNN training performance while reducing the monetary cost by up to 68:1% compared to the existing solutions, yet with acceptable runtime overhead.

More Like this