skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distributed robust statistical learning: Byzantine mirror descent
We consider the distributed statistical learning problem in a high-dimensional adversarial scenario. At each iteration, $$m$$ worker machines compute stochastic gradients and send them to a master machine. However, an $$\alpha$$-fraction of $$m$$ worker machines, called Byzantine machines, may act adversarially and send faulty gradients. To guard against faulty information sharing, we develop a distributed robust learning algorithm based on Nesterov's dual averaging. This algorithms is provably robust against Byzantine machines whenever $$\alpha\in[0, 1/2)$$. For smooth convex functions, we show that running the proposed algorithm for $$T$$ iterations achieves a statistical error bound $$\tilde{O}\big(1/\sqrt{mT}+\alpha/\sqrt{T}\big)$$. This result holds for a large class of normed spaces and it matches the known statistical error bound for Byzantine stochastic gradient in the Euclidean space setting. A key feature of the algorithm is that the dimension dependence of the bound scales with the dual norm of the gradient; in particular, for probability simplex, we show that it depends logarithmically on the problem dimension $$d$$. Such a weak dependence on the dimension is desirable in high-dimensional statistical learning and it has been known to hold for the classical mirror descent but it appears to be new for the Byzantine gradient scenario.  more » « less
Award ID(s):
1708906 1809833
PAR ID:
10128754
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2019 IEEE Conference on Decision and Control (CDC)
Page Range / eLocation ID:
1822-1827
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study distributed stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomial-time outlier-filtering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly-convex and non-convex objectives and show that our convergence rates match that of vanilla SGD in the Byzantine-free setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model. 
    more » « less
  2. We study distributed stochastic gradient descent (SGD) in the master-worker architecture under Byzantine at- tacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomial-time outlier-filtering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly- convex and non-convex objectives and show that our convergence rates match that of vanilla SGD in the Byzantine-free setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model. 
    more » « less
  3. We consider distributed gradient computation, where both data and computation are distributed among m worker machines, t of which can be Byzantine adversaries, and a designated (master) node computes the model/parameter vector for generalized linear models, iteratively, using proximal gradient descent (PGD), of which gradient descent (GD) is a special case. The Byzantine adversaries can (collaboratively) deviate arbitrarily from their gradient computation. To solve this, we propose a method based on data encoding and (real) error correction to combat the adversarial behavior. We can tolerate up to t <= (m−1)/2 corrupt worker nodes, which is 2 information-theoretically optimal. Our method does not assume any probability distribution on the data. We develop a sparse encoding scheme which enables computationally efficient data encoding. We demonstrate a trade-off between the number of adversaries tolerated and the resource requirement (storage and computational complexity). As an example, our scheme incurs a constant overhead (storage and computational complexity) over that required by the distributed PGD algorithm, without adversaries, for t <= m . Our encoding works as efficiently in the streaming data setting as it does in the offline setting, in which all the data is available beforehand. 
    more » « less
  4. We study stochastic gradient descent (SGD) in the master-worker architecture under Byzantine attacks. Building upon the recent advances in algorithmic high-dimensional robust statistics, in each SGD iteration, master employs a non-trivial decoding to estimate the true gradient from the unbiased stochastic gradients received from workers, some of which may be corrupt. We provide convergence analyses for both strongly-convex and non-convex smooth objectives under standard SGD assumptions. We can control the approximation error of our solution in both these settings by the mini-batch size of stochastic gradients; and we can make the approximation error as small as we want, provided that workers use a sufficiently large mini-batch size. Our algorithm can tolerate less than 1/3 fraction of Byzantine workers. It can approximately find the optimal parameters in the strongly-convex setting exponentially fast, and reaches to an approximate stationary point in the non-convex setting with linear speed, i.e., with a rate of 1/T, thus, matching the convergence rates of vanilla SGD in the Byzantine-free setting. 
    more » « less
  5. We study a distributed policy evaluation problem in which a group of agents with jointly observed states and private local actions and rewards collaborate to learn the value function of a given policy via local computation and communication. This problem arises in various large-scale multi-agent systems, including power grids, intelligent transportation systems, wireless sensor networks, and multi-agent robotics. We develop and analyze a new distributed temporal-difference learning algorithm that minimizes the mean-square projected Bellman error. Our approach is based on a stochastic primal-dual method and we improve the best-known convergence rate from $$O(1/\sqrt{T})$$ to $O(1/T)$, where $$T$$ is the total number of iterations. Our analysis explicitly takes into account the Markovian nature of the sampling and addresses a broader class of problems than the commonly-used i.i.d. sampling scenario. 
    more » « less