skip to main content


This content will become publicly available on May 28, 2024

Title: Matrix Multiplication with Straggler Tolerance in Coded Elastic Computing via Lagrange Code
In cloud computing systems, elastic events and stragglers increase the uncertainty of the system, leading to computation delays. Coded elastic computing (CEC) introduced by Yang et al. in 2018 is a framework which mitigates the impact of elastic events using Maximum Distance Separable (MDS) coded storage. It proposed a CEC scheme for both matrix-vector multiplication and general matrix-matrix multiplication applications. However, in these applications, the proposed CEC scheme cannot tolerate stragglers due to the limitations imposed by MDS codes. In this paper we propose a new elastic computing scheme using uncoded storage and Lagrange coded computing approaches. The proposed scheme can effectively mitigate the effects of both elasticity and stragglers. Moreover, it produces a lower complexity and smaller recovery threshold compared to existing coded storage based schemes.  more » « less
Award ID(s):
2145835
NSF-PAR ID:
10490351
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
ICC 2023 - IEEE International Conference on Communications
ISSN:
1938-1883
Page Range / eLocation ID:
136 to 141
Format(s):
Medium: X
Location:
Rome, Italy
Sponsoring Org:
National Science Foundation
More Like this
  1. Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be re-defined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers - the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%. 
    more » « less
  2. We study the optimal design of a heterogeneous coded elastic computing (CEC) network where machines have varying relative computation speeds. CEC introduced by Yang et al. is a framework which mitigates the impact of elastic events, where machines join and leave the network. A set of data is distributed among storage constrained machines using a Maximum Distance Separable (MDS) code such that any subset of machines of a specific size can perform the desired computations. This design eliminates the need to re-distribute the data after each elastic event. In this work, we develop a process for an arbitrary heterogeneous computing network to minimize the overall computation time by defining an optimal computation load, or number of computations assigned to each machine. We then present an algorithm to define a specific computation assignment among the machines that makes use of the MDS code and meets the optimal computation load. 
    more » « less
  3. Matrix multiplication is a fundamental building block for large scale computations arising in various applications, including machine learning. There has been significant recent interest in using coding to speed up distributed matrix multiplication, that are robust to stragglers (i.e., machines that may perform slower computations). In many scenarios, instead of exact computation, approximate matrix multiplication, i.e., allowing for a tolerable error is also sufficient. Such approximate schemes make use of randomization techniques to speed up the computation process. In this paper, we initiate the study of approximate coded matrix multiplication, and investigate the joint synergies offered by randomization and coding. Specifically, we propose two coded randomized sampling schemes that use (a) codes to achieve a desired recovery threshold and (b) random sampling to obtain approximation of the matrix multiplication. Tradeoffs between the recovery threshold and approximation error obtained through random sampling are investigated for a class of coded matrix multiplication schemes. 
    more » « less
  4. There has been a growing interest, in both theory and practice, in using the available redundancy in storage systems for mitigating stragglers in content download. This paper is concerned with MDS coded storage systems and studies (n, k) data access model. When k = n, system is equivalent to a fork-join queue, which is known to be notoriously hard to analyze, while system with k = 1 has been previously shown to be equivalent to an M/G/1 queue. We here argue that the system with k = 2 is of practical interest, and then present a method that approximates the system as an M/G/1 queue. Approximated download time is shown to be more accurate than the bounds available in the literature. We also note that the presented method can be used for approximating systems that employ other newly designed and deployed storage codes. 
    more » « less
  5. Elasticity is one important feature in modern cloud computing systems and can result in computation failure or significantly increase computing time. Such elasticity means that virtual machines over the cloud can be preempted under a short notice (e.g., hours or minutes) if a high-priority job appears; on the other hand, new virtual machines may become available over time to compensate the computing resources. Coded Storage Elastic Computing (CSEC) introduced by Yang et al. in 2018 is an effective and efficient approach to overcome the elasticity and it costs relatively less storage and computation load. However, one of the limitations of the CSEC is that it may only be applied to certain types of computations (e.g., linear) and may be challenging to be applied to more involved computations because the coded data storage and approximation are often needed. Hence, it may be preferred to use uncoded storage by directly copying data into the virtual machines. In addition, based on our own measurement, virtual machines on Amazon EC2 clusters often have heterogeneous computation speed even if they have exactly the same configurations (e.g., CPU, RAM, I/O cost). In this paper, we introduce a new optimization framework on Uncoded Storage Elastic Computing (USEC) systems with heterogeneous computing speed to minimize the overall computation time. Under this framework, we propose optimal solutions of USEC systems with or without straggler tolerance using different storage placements. Our proposed algorithms are evaluated using power iteration applications on Amazon EC2. 
    more » « less