skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adaptive Verifiable Coded Computing: Towards Fast, Secure and Private Distributed Machine Learning
Stragglers, Byzantine workers, and data privacy are the main bottlenecks in distributed cloud computing. Some prior works proposed coded computing strategies to jointly address all three challenges. They require either a large number of workers, a significant communication cost or a significant computational complexity to tolerate Byzantine workers. Much of the overhead in prior schemes comes from the fact that they tightly couple coding for all three problems into a single framework. In this paper, we propose Adaptive Verifiable Coded Computing (AVCC) framework that decouples the Byzantine node detection challenge from the straggler tolerance. AVCC leverages coded computing just for handling stragglers and privacy, and then uses an orthogonal approach that leverages verifiable computing to mitigate Byzantine workers. Furthermore, AVCC dynamically adapts its coding scheme to trade-off straggler tolerance with Byzantine protection. We evaluate AVCC on a compute-intensive distributed logistic regression application. Our experiments show that AVCC achieves up to 4.2× speedup and up to 5.1% accuracy improvement over the state-of-the-art Lagrange coded computing approach (LCC). AVCC also speeds up the conventional uncoded implementation of distributed logistic regression by up to 7.6×, and improves the test accuracy by up to 12.1%.  more » « less
Award ID(s):
2002874
PAR ID:
10426382
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Page Range / eLocation ID:
628 to 638
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider a scenario involving computations over a massive dataset stored distributedly across multiple workers, which is at the core of distributed learning algorithms. We propose Lagrange Coded Computing (LCC), a new framework to simultaneously provide (1) resiliency against stragglers that may prolong computations; (2) security against Byzantine (or malicious) workers that deliberately modify the computation for their benefit; and (3) (information-theoretic) privacy of the dataset amidst possible collusion of workers. LCC, which leverages the well-known Lagrange polynomial to create computation redundancy in a novel coded form across workers, can be applied to any computation scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset, hence covering many computations of interest in machine learning. LCC significantly generalizes prior works to go beyond linear computations. It also enables secure and private computing in distributed settings, improving the computation and communication efficiency of the state-of-the-art. Furthermore, we prove the optimality of LCC by showing that it achieves the optimal tradeoff between resiliency, security, and privacy, i.e., in terms of tolerating the maximum number of stragglers and adversaries, and providing data privacy against the maximum number of colluding workers. Finally, we show via experiments on Amazon EC2 that LCC speeds up the conventional uncoded implementation of distributed least-squares linear regression by up to 13.43×, and also achieves a 2.36×-12.65× speedup over the state-of-the-art straggler mitigation strategies. 
    more » « less
  2. Our extensive real measurements over Amazon EC2 show that the virtual instances often have different computing speeds even if they share the same configurations. This motivates us to study heterogeneous Coded Storage Elastic Computing (CSEC) systems where machines, with different computing speeds, join and leave the network arbitrarily over different computing steps. In CSEC systems, a Maximum Distance Separable (MDS) code is used for coded storage such that the file placement does not have to be re-defined with each elastic event. Computation assignment algorithms are used to minimize the computation time given computation speeds of different machines. While previous studies of heterogeneous CSEC do not include stragglers - the slow machines during the computation, we develop a new framework in heterogeneous CSEC that introduces straggler tolerance. Based on this framework, we design a novel algorithm using our previously proposed approach for heterogeneous CSEC such that the system can handle any subset of stragglers of a specified size while minimizing the computation time. Furthermore, we establish a trade-off in computation time and straggler tolerance. Another major limitation of existing CSEC designs is the lack of practical evaluations using real applications. In this paper, we evaluate the performance of our designs on Amazon EC2 for applications of the power iteration and linear regression. Evaluation results show that the proposed heterogeneous CSEC algorithms outperform the state-of-the-art designs by more than 30%. 
    more » « less
  3. Due to the surge of cloud-assisted AI services, the problem of designing resilient prediction serving systems that can effectively cope with stragglers and minimize response delays has attracted much interest. The common approach for tackling this problem is replication which assigns the same prediction task to multiple workers. This approach, however, is inefficient and incurs significant resource overheads. Hence, a learning-based approach known as parity model (ParM) has been recently proposed which learns models that can generate ``parities’’ for a group of predictions to reconstruct the predictions of the slow/failed workers. While this learning-based approach is more resource-efficient than replication, it is tailored to the specific model hosted by the cloud and is particularly suitable for a small number of queries (typically less than four) and tolerating very few stragglers (mostly one). Moreover, ParM does not handle Byzantine adversarial workers. We propose a different approach, named Approximate Coded Inference (ApproxIFER), that does not require training any parity models, hence it is agnostic to the model hosted by the cloud and can be readily applied to different data domains and model architectures. Compared with earlier works, ApproxIFER can handle a general number of stragglers and scales significantly better with the number of queries. Furthermore, ApproxIFER is robust against Byzantine workers. Our extensive experiments on a large number of datasets and model architectures show significant degraded mode accuracy improvement by up to 58% over ParM. 
    more » « less
  4. A communication-efficient protocol is introduced over a many-to-one quantum network for Q-E-B-MDS-X-TPIR, i.e., quantum private information retrieval with MDS-X-secure storage and T-private queries. The protocol is resilient to any set of up to E unresponsive servers (erased servers or stragglers) and any set of up to B Byzantine servers. The underlying coding scheme incorporates an enhanced version of a Cross Subspace Alignment (CSA) code, namely a Modified CSA (MCSA) code, into the framework of CSS codes. The error- correcting capabilities of CSS codes are leveraged to encode the dimensions that carry desired computation results from the MCSA code into the error space of the CSS code, while the undesired interference terms are aligned into the stabilized code space. The challenge is to do this efficiently while also correcting quantum erasures and Byzantine errors. The protocol achieves superdense coding gain over comparable classical baselines for Q-E-B-MDS-X-TPIR, recovers as special cases the state of art results for various other quantum PIR settings previously studied in the literature, and paves the way for applications in quantum coded distributed computation, where CSA code structures are important for communication efficiency, while security and resilience to stragglers and Byzantine servers are critical. 
    more » « less
  5. Coded elastic computing, introduced by Yang et al. in 2018, is a technique designed to mitigate the impact of elasticity in cloud computing systems, where machines can be preempted or be added during computing rounds. This approach utilizes maximum distance separable (MDS) coding for both storage and download in matrix-matrix multiplications. The proposed scheme is unable to tolerate stragglers and has high encoding complexity and upload cost. In 2023, we addressed these limitations by employing uncoded storage and Lagrange-coded download. However, it results in a large storage size. To address the challenges of storage size and upload cost, in this paper, we focus on Lagrange-coded elastic computing based on uncoded download. We propose a new class of elastic computing schemes, using Lagrange-coded storage with uncoded download (LCSUD). Our proposed schemes address both elasticity and straggler challenges while achieving lower storage size, reduced encoding complexity, and upload cost compared to existing methods. 
    more » « less