skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Practical Algorithm Design and Evaluation for Heterogeneous Elastic Computing with Stragglers
Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be re-defined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers - the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%.  more » « less
Award ID(s):
1817154
PAR ID:
10393717
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE Global Communications Conference (GLOBECOM)
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Elasticity is one important feature in modern cloud computing systems and can result in computation failure or significantly increase computing time. Such elasticity means that virtual machines over the cloud can be preempted under a short notice (e.g., hours or minutes) if a high-priority job appears; on the other hand, new virtual machines may become available over time to compensate the computing resources. Coded Storage Elastic Computing (CSEC) introduced by Yang et al. in 2018 is an effective and efficient approach to overcome the elasticity and it costs relatively less storage and computation load. However, one of the limitations of the CSEC is that it may only be applied to certain types of computations (e.g., linear) and may be challenging to be applied to more involved computations because the coded data storage and approximation are often needed. Hence, it may be preferred to use uncoded storage by directly copying data into the virtual machines. In addition, based on our own measurement, virtual machines on Amazon EC2 clusters often have heterogeneous computation speed even if they have exactly the same configurations (e.g., CPU, RAM, I/O cost). In this paper, we introduce a new optimization framework on Uncoded Storage Elastic Computing (USEC) systems with heterogeneous computing speed to minimize the overall computation time. Under this framework, we propose optimal solutions of USEC systems with or without straggler tolerance using different storage placements. Our proposed algorithms are evaluated using power iteration applications on Amazon EC2. 
    more » « less
  2. Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction 1/L of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines. 
    more » « less
  3. In cloud computing systems, elastic events and stragglers increase the uncertainty of the system, leading to computation delays. Coded elastic computing (CEC) introduced by Yang et al. in 2018 is a framework which mitigates the impact of elastic events using Maximum Distance Separable (MDS) coded storage. It proposed a CEC scheme for both matrix-vector multiplication and general matrix-matrix multiplication applications. However, in these applications, the proposed CEC scheme cannot tolerate stragglers due to the limitations imposed by MDS codes. In this paper we propose a new elastic computing scheme using uncoded storage and Lagrange coded computing approaches. The proposed scheme can effectively mitigate the effects of both elasticity and stragglers. Moreover, it produces a lower complexity and smaller recovery threshold compared to existing coded storage based schemes. 
    more » « less
  4. Coded elastic computing, introduced by Yang et al. in 2018, is a technique designed to mitigate the impact of elasticity in cloud computing systems, where machines can be preempted or be added during computing rounds. This approach utilizes maximum distance separable (MDS) coding for both storage and download in matrix-matrix multiplications. The proposed scheme is unable to tolerate stragglers and has high encoding complexity and upload cost. In 2023, we addressed these limitations by employing uncoded storage and Lagrange-coded download. However, it results in a large storage size. To address the challenges of storage size and upload cost, in this paper, we focus on Lagrange-coded elastic computing based on uncoded download. We propose a new class of elastic computing schemes, using Lagrange-coded storage with uncoded download (LCSUD). Our proposed schemes address both elasticity and straggler challenges while achieving lower storage size, reduced encoding complexity, and upload cost compared to existing methods. 
    more » « less
  5. We consider a scenario involving computations over a massive dataset stored distributedly across multiple workers, which is at the core of distributed learning algorithms. We propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide (1) resiliency against stragglers that may prolong computations; (2) security against Byzantine (or malicious) workers that deliberately modify the computation for their benefit; and (3) (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC, which leverages the well-known Lagrange polynomial to create computation redundancy in a novel coded form across workers, can be applied to any computation scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset, hence covering many computations of interest in machine learning. LCC significantly generalizes prior works to go beyond linear computations. It also enables secure and private computing in distributed settings, improving the computation and communication efficiency of the state-of-the-art. Furthermore, we prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy, i.e., in terms of tolerating the maximum number of stragglers and adversaries, and providing data privacy against the maximum number of colluding workers. Finally, we show via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regression by up to 13.43×, and also achieves a 2.36×-12.65× speedup over the state-of-the-art straggler mitigation strategies. 
    more » « less