We study stochastic gradient descent (SGD) with local iterations in the presence of malicious/Byzantine clients, motivated by the federated learning. The clients, instead of communicating with the central server in every iteration, maintain their local models, which they update by taking several SGD iterations based on their own datasets and then communicate the net update with the server, thereby achieving communicationefficiency. Furthermore, only a subset of clients communicate with the server, and this subset may be different at different synchronization times. The Byzantine clients may collaborate and send arbitrary vectors to the server to disrupt the learning process. To combat the adversary, we employ an efficient highdimensional robust mean estimation algorithm from Steinhardt et al.~i̧te[ITCS 2018]Resilience_SCV18 at the server to filterout corrupt vectors; and to analyze the outlierfiltering procedure, we develop a novel matrix concentration result that may be of independent interest. We provide convergence analyses for stronglyconvex and nonconvex smooth objectives in the heterogeneous data setting, where different clients may have different local datasets, and we do not make any probabilistic assumptions on data generation. We believe that ours is the first Byzantineresilient algorithm and analysis with local iterations. We derive our convergence results under minimal assumptions of boundedmore »
SGD and Hogwild! Convergence Without the Bounded Gradients Assumption
Stochastic gradient descent (SGD) is the optimization algorithm of choice in many machine learning applications such as regularized empirical risk minimization and training deep neural networks. The classical analysis of convergence of SGD is carried out under the assumption that the norm of the stochastic gradient is uniformly bounded. While this might hold for some loss functions, it is always violated for cases where the objective function is strongly convex. In (Bottou et al.,2016) a new analysis of convergence of SGD is performed under the assumption that stochastic gradients are bounded with respect to the true gradient norm. Here we show that for stochastic problems arising in machine learning such bound always holds. Moreover, we propose an alternative convergence analysis of SGD with diminishing learning rate regime, which is results in more relaxed conditions that those in (Bottou et al.,2016). We then move on the asynchronous parallel setting, and prove convergence of the Hogwild! algorithm in the same regime, obtaining the first convergence results for this method in the case of diminished learning rate.
 Publication Date:
 NSFPAR ID:
 10110877
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 80
 Page Range or eLocationID:
 37503758
 ISSN:
 26403498
 Sponsoring Org:
 National Science Foundation
More Like this


We study stochastic gradient descent (SGD) with local iterations in the presence of malicious/Byzantine clients, motivated by the federated learning. The clients, instead of communicating with the central server in every iteration, maintain their local models, which they update by taking several SGD iterations based on their own datasets and then communicate the net update with the server, thereby achieving communicationefficiency. Furthermore, only a subset of clients communicate with the server, and this subset may be different at different synchronization times. The Byzantine clients may collaborate and send arbitrary vectors to the server to disrupt the learning process. To combat the adversary, we employ an efficient highdimensional robust mean estimation algorithm from Steinhardt et al.at the server to filterout corrupt vectors; and to analyze the outlierfiltering procedure, we develop a novel matrix concentration result that may be of independent interest. We provide convergence analyses for stronglyconvex and nonconvex smooth objectives in the heterogeneous data setting, where different clients may have different local datasets, and we do not make any probabilistic assumptions on data generation. We believe that ours is the first Byzantineresilient algorithm and analysis with local iterations. We derive our convergence results under minimal assumptions of bounded variance formore »

We study distributed stochastic gradient descent (SGD) in the masterworker architecture under Byzantine attacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomialtime outlierfiltering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filterout corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth stronglyconvex and nonconvex objectives and show that our convergence rates match that of vanilla SGD in the Byzantinefree setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model.

We study distributed stochastic gradient descent (SGD) in the masterworker architecture under Byzantine at tacks. We consider the heterogeneous data model, where different workers may have different local datasets, and we do not make any probabilistic assumptions on data generation. At the core of our algorithm, we use the polynomialtime outlierfiltering procedure for robust mean estimation proposed by Steinhardt et al. (ITCS 2018) to filterout corrupt gradients. In order to be able to apply their filtering procedure in our heterogeneous data setting where workers compute stochastic gradients, we derive a new matrix concentration result, which may be of independent interest. We provide convergence analyses for smooth strongly convex and nonconvex objectives and show that our convergence rates match that of vanilla SGD in the Byzantinefree setting. In order to bound the heterogeneity, we assume that the gradients at different workers have bounded deviation from each other, and we also provide concrete bounds on this deviation in the statistical heterogeneous data model.

Structured nonconvex learning problems, for which critical points have favorable statistical properties, arise frequently in statistical machine learning. Algorithmic convergence and statistical estimation rates are wellunderstood for such problems. However, quantifying the uncertainty associated with the underlying training algorithm is not wellstudied in the nonconvex setting. In order to address this shortcoming, in this work, we establish an asymptotic normality result for the constant step size stochastic gradient descent (SGD) algorithm—a widely used algorithm in practice. Specifically, based on the relationship between SGD and Markov Chains [1], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution, as long as the nonconvex and nonsmooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the critical points of the objective function under various local regularity conditions. Together, the above two results could be leveraged to construct confidence intervals for nonconvex problems that are trained using the SGD algorithm.