We study the optimal design of a heterogeneous coded elastic computing (CEC) network where machines have varying relative computation speeds. CEC introduced by Yang et al. is a framework which mitigates the impact of elastic events, where machines join and leave the network. A set of data is distributed among storage constrained machines using a Maximum Distance Separable (MDS) code such that any subset of machines of a specific size can perform the desired computations. This design eliminates the need to re-distribute the data after each elastic event. In this work, we develop a process for an arbitrary heterogeneous computing network to minimize the overall computation time by defining an optimal computation load, or number of computations assigned to each machine. We then present an algorithm to define a specific computation assignment among the machines that makes use of the MDS code and meets the optimal computation load.
more »
« less
Topological Coded Distributed Computing
This paper considers the MapReduce-like coded distributed computing framework originally proposed by Li et al., which uses coding techniques when distributed computing servers exchange their computed intermediate values, in order to reduce the overall traffic load. In their original model, servers are connected via an error-free common communication bus allowing broadcast transmissions. However, this assumption is one of the major limitations for practical implementations since real-world data centers may have network topologies far more involved than a single broadcast bus. We formulate a topological coded distributed computing problem, where the computing servers communicate with each other through some switch network. By using a special instance of fat-tree topologies, referred to as t-ary fat-tree proposed by Al-Fares et al. which can be built by some inexpensive switches, we propose a coded distributed computing scheme to achieve the optimal max-link communication load (defined as the maximum load over all links) over any network topology.
more »
« less
- Award ID(s):
- 1817154
- NSF-PAR ID:
- 10297775
- Date Published:
- Journal Name:
- GLOBECOM 2020 - 2020 IEEE Global Communications Conference
- Page Range / eLocation ID:
- 1 to 6
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The widely-studied radio network model [Chlamtac and Kutten, 1985] is a graph-based description that captures the inherent impact of collisions in wireless communication. In this model, the strong assumption is made that node v receives a message from a neighbor if and only if exactly one of its neighbors broadcasts. We relax this assumption by introducing a new noisy radio network model in which random faults occur at senders or receivers. Specifically, for a constant noise parameter p ∈ [0,1), either every sender has probability p of transmitting noise or every receiver of a single transmission in its neighborhood has probability p of receiving noise. We first study single-message broadcast algorithms in noisy radio networks and show that the Decay algorithm [Bar-Yehuda et al., 1992] remains robust in the noisy model while the diameter-linear algorithm of Gasieniec et al., 2007 does not. We give a modified version of the algorithm of Gasieniec et al., 2007 that is robust to sender and receiver faults, and extend both this modified algorithm and the Decay algorithm to robust multi-message broadcast algorithms, broadcasting Ω(1/log n log log n) and Ω(1/log n) messages per round, respectively. We next investigate the extent to which (network) coding improves throughput in noisy radio networks. In particular, we study the coding cap -- the ratio of the throughput of coding to that of routing -- in noisy radio networks. We address the previously perplexing result of Alon et al. 2014 that worst case coding throughput is no better than worst case routing throughput up to constants: we show that the worst case throughput performance of coding is, in fact, superior to that of routing -- by a Θ(log(n)) gap -- provided receiver faults are introduced. However, we show that sender faults have little effect on throughput. In particular, we show that any coding or routing scheme for the noiseless setting can be transformed to be robust to sender faults with only a constant throughput overhead. These transformations imply that the results of Alon et al., 2014 carry over to noisy radio networks with sender faults as well. As a result, if sender faults are introduced then there exist topologies for which there is a Θ(log log n) gap, but the worst case throughput across all topologies is Θ(1/log n) for both coding and routing.more » « less
-
Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all three challenges. They require either a large number of workers, a significant communication cost or a significant computational complexity to tolerate Byzantine workers. Much of the overhead in prior schemes comes from the fact that they tightly couple coding for all three problems into a single framework. In this paper, we propose Adaptive Verifiable Coded Computing (AVCC) framework that decouples the Byzantine node detection challenge from the straggler tolerance. AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers. Furthermore, AVCC dynamically adapts its coding scheme to trade-off straggler tolerance with Byzantine protection. We evaluate AVCC on a compute-intensive distributed logistic regression application. Our experiments show that AVCC achieves up to 4.2× speedup and up to 5.1% accuracy improvement over the state-of-the-art Lagrange coded computing approach (LCC). AVCC also speeds up the conventional uncoded implementation of distributed logistic regression by up to 7.6×, and improves the test accuracy by up to 12.1%.more » « less
-
In cloud computing systems, elastic events and stragglers increase the uncertainty of the system, leading to computation delays. Coded elastic computing (CEC) introduced by Yang et al. in 2018 is a framework which mitigates the impact of elastic events using Maximum Distance Separable (MDS) coded storage. It proposed a CEC scheme for both matrix-vector multiplication and general matrix-matrix multiplication applications. However, in these applications, the proposed CEC scheme cannot tolerate stragglers due to the limitations imposed by MDS codes. In this paper we propose a new elastic computing scheme using uncoded storage and Lagrange coded computing approaches. The proposed scheme can effectively mitigate the effects of both elasticity and stragglers. Moreover, it produces a lower complexity and smaller recovery threshold compared to existing coded storage based schemes.more » « less
-
Machine learning today involves massive distributed computations running on cloud servers, which are highly susceptible to slowdown or straggling. Recent work has demonstrated the effectiveness of erasure codes in mitigating such slowdown for linear computations, by adding redundant computations such that the entire computation can be recovered as long as a subset of nodes finish their assigned tasks. However, most machine learning algorithms typically involve non-linear computations that cannot be directly handled by these coded computing approaches. In this work, we propose a coded computing strategy for mitigating the effect of stragglers on non-linear distributed computations. Our strategy relies on the observation that many expensive non-linear functions can be decomposed into sums of cheap non-linear functions. We show that erasure codes, specifically rateless codes can be used to generate and compute random linear combinations of these functions at the nodes such that the original function can be computed as long as a subset of nodes return their computations. Simulations and experiments on AWS Lambda demonstrate the superiority of our approach over various uncoded baselines.more » « less