Machine learning today involves massive distributed computations running on cloud servers, which are highly susceptible to slowdown or straggling. Recent work has demonstrated the effectiveness of erasure codes in mitigating such slowdown for linear computations, by adding redundant computations such that the entire computation can be recovered as long as a subset of nodes finish their assigned tasks. However, most machine learning algorithms typically involve non-linear computations that cannot be directly handled by these coded computing approaches. In this work, we propose a coded computing strategy for mitigating the effect of stragglers on non-linear distributed computations. Our strategy relies on the observation that many expensive non-linear functions can be decomposed into sums of cheap non-linear functions. We show that erasure codes, specifically rateless codes can be used to generate and compute random linear combinations of these functions at the nodes such that the original function can be computed as long as a subset of nodes return their computations. Simulations and experiments on AWS Lambda demonstrate the superiority of our approach over various uncoded baselines.
more »
« less
Rateless Sum-Recovery Codes For Distributed Non-Linear Computations
We address the problem of slowdown caused by straggling nodes in distributed non-linear computations. Many common non-linear computations can be written as a sum of inexpensive non-linear functions (e.g. Taylor series). Based on this observation, we propose a new class of rateless codes called rateless sum-recovery codes whose aim is to recover the sum of source symbols, without necessarily recovering individual symbols. Source symbols correspond to individual inexpensive functions and each encoded symbol is the sum of a subset of source symbols. Encoded symbols are computed in a distributed fashion and for a computation that can be written as a sum of m inexpensive functions, successful sum-recovery is possible with high probability as long as slightly more than m encoded symbols are received. Our code is rateless, systematic and has sparse parities. Moreover, encoded symbols are constructed by sampling without replacement at individual nodes, thereby making decoding superfluous if the encoded symbols from any node cover all source symbols. We validate our claims through a range of simulations and also discuss open questions for future works.
more »
« less
- Award ID(s):
- 2045694
- PAR ID:
- 10389236
- Date Published:
- Journal Name:
- Proceedings of the IEEE Information Theory Workshop (ITW)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We propose novel non-linear graph-based analog codes that directly encode k real-valued source samples into n real-valued samples by using (non-linear) sample-by-sample soft quantization of the input samples followed by a linear transformation on the soft-quantized values. Different from existing analog coding schemes, the proposed analog codes are able to produce additional output symbols in a rateless manner and can be decoded utilizing message passing algorithms dealing with real-valued nodes, achieving a performance close to the theoretical limits.more » « less
-
We consider the problem of coded distributed computing where a large linear computational job, such as a matrix multiplication, is divided into $$k$$ smaller tasks, encoded using an $(n,k)$ linear code, and performed over $$n$$ distributed nodes. The goal is to reduce the average execution time of the computational job. We provide a connection between the problem of characterizing the average execution time of a coded distributed computing system and the problem of analyzing the error probability of codes of length $$n$$ used over erasure channels. Accordingly, we present closed-form expressions for the execution time using binary random linear codes and the best execution time any linear-coded distributed computing system can achieve. It is also shown that there exist good binary linear codes that attain, asymptotically, the best performance any linear code, not necessarily binary, can achieve. We also investigate the performance of coded distributed computing systems using polar and Reed-Muller (RM) codes that can benefit from low-complexity decoding, and superior performance, respectively, as well as explicit constructions. The proposed framework in this paper can enable efficient designs of distributed computing systems given the rich literature in the channel coding theory.more » « less
-
The A -channel is a noiseless multiple access channel in which users simultaneously transmit Q-ary symbols and the receiver observes the set of transmitted symbols, but not their multiplicities. An A-channel is said to be unsourced if, additionally, users' transmissions are encoded across time using a common codebook and decoding of the transmitted messages is done without regard to the identities of the active users. An interesting variant of the unsourced A -channel is the unsourced A-channel with erasures (UACE), in which transmitted symbols are erased with a given independent and identically distributed probability. In this paper, we focus on designing a code that enables a list of transmitted codewords to be recovered despite the erasures of some of the transmitted symbols. To this end, we propose the linked-loop code (LLC), which uses parity bits to link each symbol to the previous M symbols in a tail-biting manner, i.e., the first symbols of the transmission are linked to the last ones. The decoding process occurs in two phases: the first phase decodes the codewords that do not suffer from any erasures, and the second phase attempts to recover the erased symbols using the available parities. We compare the performance of the LLC over the UACE with other codes in the literature and argue for the effectiveness of the construction. Our motivation for studying the UACE comes from its relevance in machine-type communication and coded compressed sensing.more » « less
-
Abstract A distributed sensing protocol uses a network of local sensing nodes to estimate a global feature of the network, such as a weighted average of locally detectable parameters. In the noiseless case, continuous-variable (CV) multipartite entanglement shared by the nodes can improve the precision of parameter estimation relative to the precision attainable by a network without shared entanglement; for an entangled protocol, the root mean square estimation error scales like 1/Mwith the numberMof sensing nodes, the so-called Heisenberg scaling, while for protocols without entanglement, the error scales like . However, in the presence of loss and other noise sources, although multipartite entanglement still has some advantages for sensing displacements and phases, the scaling of the precision withMis less favorable. In this paper, we show that using CV error correction codes can enhance the robustness of sensing protocols against imperfections and reinstate Heisenberg scaling up to moderate values ofM. Furthermore, while previous distributed sensing protocols could measure only a single quadrature, we construct a protocol in which both quadratures can be sensed simultaneously. Our work demonstrates the value of CV error correction codes in realistic sensing scenarios.more » « less