skip to main content


Title: Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent
Coded distributed computation has become common practice for performing gradient descent on large datasets to mitigate stragglers and other faults. This paper proposes a novel algorithm that encodes the partial derivatives themselves and furthermore optimizes the codes by performing lossy compression on the derivative codewords by maximizing the information contained in the codewords while minimizing the information between the codewords. The utility of this application of coding theory is a geometrical consequence of the observed fact in optimization research that noise is tolerable, sometimes even helpful, in gradient descent based learning algorithms since it helps avoid overfitting and local minima. This stands in contrast with much current conventional work on distributed coded computation which focuses on recovering all of the data from the workers. A second further contribution is that the low-weight nature of the coding scheme allows for asynchronous gradient updates since the code can be iteratively decoded; i.e., a worker’s task can immediately be updated into the larger gradient. The directional derivative is always a linear function of the direction vectors; thus, our framework is robust since it can apply linear coding techniques to general machine learning frameworks such as deep neural networks.  more » « less
Award ID(s):
2101388
NSF-PAR ID:
10385312
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 39th International Conference on Machine Learning
Volume:
PMLR 162
Page Range / eLocation ID:
20444 - 20458
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A new class of structured codes called quasi group codes (QGCs) is introduced. A QGC is a subset of a group code. In contrast with the group codes, QGCs are not closed under group addition. The parameters of the QGC can be chosen, such that the size of C C is equal to any number between C and C 2 . We analyze the performance of a specific class of QGCs. This class of QGCs is constructed by assigning single-letter distributions to the indices of the codewords in a group code. Then, the QGC is defined as the set of codewords whose index is in the typical set corresponding to these singleletter distributions. The asymptotic performance limits of this class of QGCs are characterized using single-letter information quantities. Corresponding covering and packing bounds are derived. It is shown that the point-to-point channel capacity and optimal rate-distortion function are achievable using QGCs. Coding strategies based on QGCs are introduced for three fundamental multi-terminal problems: the Körner-Marton problem for modulo prime-power sums, computation over the multiple access channel (MAC), and MAC with distributed states. For each problem, a single-letter achievable rate-region is derived. It is shown, through examples, that the coding strategies improve upon the previous strategies based on the unstructured codes, linear codes, and group codes. Index Terms— Quasi structure 
    more » « less
  2. The optic nerve transmits visual information to the brain as trains of discrete events, a low-power, low-bandwidth communication channel also exploited by silicon retina cameras. Extracting highfidelity visual input from retinal event trains is thus a key challenge for both computational neuroscience and neuromorphic engineering. Here, we investigate whether sparse coding can enable the reconstruction of high-fidelity images and video from retinal event trains. Our approach is analogous to compressive sensing, in which only a random subset of pixels are transmitted and the missing information is estimated via inference. We employed a variant of the Locally Competitive Algorithm to infer sparse representations from retinal event trains, using a dictionary of convolutional features optimized via stochastic gradient descent and trained in an unsupervised manner using a local Hebbian learning rule with momentum. We used an anatomically realistic retinal model with stochastic graded release from cones and bipolar cells to encode thumbnail images as spike trains arising from ON and OFF retinal ganglion cells. The spikes from each model ganglion cell were summed over a 32 msec time window, yielding a noisy rate-coded image. Analogous to how the primary visual cortex is postulated to infer features from noisy spike trains arising from the optic nerve, we inferred a higher-fidelity sparse reconstruction from the noisy rate-coded image using a convolutional dictionary trained on the original CIFAR10 database. To investigate whether a similar approachworks on non-stochastic data, we demonstrate that the same procedure can be used to reconstruct high-frequency video from the asynchronous events arising from a silicon retina camera moving through a laboratory environment. 
    more » « less
  3. We consider distributed gradient computation, where both data and computation are distributed among m worker machines, t of which can be Byzantine adversaries, and a designated (master) node computes the model/parameter vector for generalized linear models, iteratively, using proximal gradient descent (PGD), of which gradient descent (GD) is a special case. The Byzantine adversaries can (collaboratively) deviate arbitrarily from their gradient computation. To solve this, we propose a method based on data encoding and (real) error correction to combat the adversarial behavior. We can tolerate up to t <= (m−1)/2 corrupt worker nodes, which is 2 information-theoretically optimal. Our method does not assume any probability distribution on the data. We develop a sparse encoding scheme which enables computationally efficient data encoding. We demonstrate a trade-off between the number of adversaries tolerated and the resource requirement (storage and computational complexity). As an example, our scheme incurs a constant overhead (storage and computational complexity) over that required by the distributed PGD algorithm, without adversaries, for t <= m . Our encoding works as efficiently in the streaming data setting as it does in the offline setting, in which all the data is available beforehand. 
    more » « less
  4. We consider a scenario involving computations over a massive dataset stored distributedly across multiple workers, which is at the core of distributed learning algorithms. We propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide (1) resiliency against stragglers that may prolong computations; (2) security against Byzantine (or malicious) workers that deliberately modify the computation for their benefit; and (3) (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC, which leverages the well-known Lagrange polynomial to create computation redundancy in a novel coded form across workers, can be applied to any computation scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset, hence covering many computations of interest in machine learning. LCC significantly generalizes prior works to go beyond linear computations. It also enables secure and private computing in distributed settings, improving the computation and communication efficiency of the state-of-the-art. Furthermore, we prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy, i.e., in terms of tolerating the maximum number of stragglers and adversaries, and providing data privacy against the maximum number of colluding workers. Finally, we show via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regression by up to 13.43×, and also achieves a 2.36×-12.65× speedup over the state-of-the-art straggler mitigation strategies. 
    more » « less
  5. Distributed learning platforms for processing large scale data-sets are becoming increasingly prevalent. In typical distributed implementations, a centralized master node breaks the data-set into smaller batches for parallel processing across distributed workers to achieve speed-up and efficiency. Several computational tasks are of sequential nature, and involve multiple passes over the data. At each iteration over the data, it is common practice to randomly re-shuffle the data at the master node, assigning different batches for each worker to process. This random re-shuffling operation comes at the cost of extra communication overhead, since at each shuffle, new data points need to be delivered to the distributed workers. In this paper, we focus on characterizing the information theoretically optimal communication overhead for the distributed data shuffling problem. We propose a novel coded data delivery scheme for the case of no excess storage, where every worker can only store the assigned data batches under processing. Our scheme exploits a new type of coding opportunity and is applicable to any arbitrary shuffle, and for any number of workers. We also present information theoretic lower bounds on the minimum communication overhead for data shuffling, and show that the proposed scheme matches this lower bound for the worst-case communication overhead. 
    more » « less