When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every iteration is a key bottleneck that limits the scalability of the system and hinders the benefit of parallel computation. Local-update SGD algorithms, where worker nodes perform local iterations of SGD and periodically synchronize their local models, can effectively reduce the communication frequency and save the communication delay. In this paper, we propose a powerful framework, named Cooperative SGD, that subsumes a variety of local-update SGD algorithms (such as local SGD, elastic averaging SGD, and decentralized parallel SGD) and provides a unified convergence analysis. Notably, special cases of the unified convergence analysis provided by the cooperative SGD framework yield 1) the first convergence analysis of elastic averaging SGD for general non-convex objectives, and 2) improvements upon previous analyses of local SGD and decentralized parallel SGD. Moreover, we design new algorithms such as elastic averaging SGD with overlapped computation and communication, and decentralized periodic averaging which are shown to be 4x or more faster than the baseline in reaching the same training loss.
more »
« less
Accelerated consensus in multi-agent networks via memory of local averages
Classical mathematical models of information
sharing and updating in multi-agent networks use linear
operators. In the paradigmatic DeGroot model, for example,
agents update their states with linear combinations of their
neighbors’ current states. In prior work, an accelerated averaging
model employing the use of memory has been suggested
to accelerate convergence to a consensus state for undirected
networks. There, the DeGroot update on the current states is
followed by a linear combination with the previous states. We
propose a modification where the DeGroot update is applied
to the current and previous states and is then followed by a
linear combination step. We show that this simple modification
applied to undirected networks permits convergence even for
periodic networks. Further, it allows for faster convergence
than DeGroot and accelerated averaging models for suitable
networks and model parameters.
more »
« less
- Award ID(s):
- 1634664
- NSF-PAR ID:
- 10309551
- Date Published:
- Journal Name:
- Proceedings of the 60th IEEE Conference on Decision and Control (CDC 2021)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced between them when a unit current is injected at the first node and extracted at the second node, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many applications and fields including solving linear systems, Markov Chains and continuous-time averaging networks. We develop an efficient linearly convergent distributed algorithm for computing effective resistances and demonstrate its performance through numerical studies. We also apply our algorithm to the consensus problem where the aim is to compute the average of node values in a distributed manner. We show that the distributed algorithm we developed for effective resistances can be used to accelerate the convergence of the classical consensus iterations considerably by a factor depending on the network structure.more » « less
-
We consider a certain class of nonlinear maps that preserve the probability simplex, i.e., stochastic maps, that are inspired by the DeGroot-Friedkin model of belief/opinion propagation over influence networks. The corresponding dynamical models describe the evolution of the probability distribution of interacting species. Such models where the probability transition mechanism depends nonlinearly on the current state are often referred to as nonlinear Markov chains. In this paper we develop stability results and study the behavior of representative opinion models. The stability certificates are based on the contractivity of the nonlinear evolution in the l1-metric. We apply the theory to two types of opinion models where the adaptation of the transition probabilities to the current state is exponential and linear, respectively–both of these can display a wide range of behaviors. We discuss continuous-time and other generalizationsmore » « less
-
null (Ed.)The significance of the water-side gas transfer velocity for air–sea CO2 gas exchange (k) and its non-linear dependence on wind speed (U) is well accepted. What remains a subject of inquiry are biases associated with the form of the non-linear relation linking k to U (hereafter labeled as f(U), where f(.) stands for an arbitrary function of U), the distributional properties of U (treated as a random variable) along with other external factors influencing k, and the time-averaging period used to determine k from U. To address the latter issue, a Taylor series expansion is applied to separate f(U) into a term derived from time-averaging wind speed (labeled as ⟨U⟩, where ⟨.⟩ indicates averaging over a monthly time scale) as currently employed in climate models and additive bias corrections that vary with the statistics of U. The method was explored for nine widely used f(U) parameterizations based on remotely-sensed 6-hourly global wind products at 10 m above the sea-surface. The bias in k of monthly estimates compared to the reference 6-hourly product was shown to be mainly associated with wind variability captured by the standard deviation σσU around ⟨U⟩ or, more preferably, a dimensionless coefficient of variation Iu= σσU/⟨U⟩. The proposed correction outperforms previous methodologies that adjusted k when using ⟨U⟩ only. An unexpected outcome was that upon setting Iu2 = 0.15 to correct biases when using monthly wind speed averages, the new model produced superior results at the global and regional scale compared to prior correction methodologies. Finally, an equation relating Iu2 to the time-averaging interval (spanning from 6 h to a month) is presented to enable other sub-monthly averaging periods to be used. While the focus here is on CO2, the theoretical tactic employed can be applied to other slightly soluble gases. As monthly and climatological wind data are often used in climate models for gas transfer estimates, the proposed approach provides a robust scheme that can be readily implemented in current climate models.more » « less
-
Stochastic optimization algorithms update models with cheap per-iteration costs sequentially, which makes them amenable for large-scale data analysis. Such algorithms have been widely studied for structured sparse models where the sparsity information is very specific, e.g., convex sparsity-inducing norms or ℓ0-norm. However, these norms cannot be directly applied to the problem of complex (non-convex) graph-structured sparsity models, which have important application in disease outbreak and social networks, etc. In this paper, we propose a stochastic gradient-based method for solving graph-structured sparsity constraint problems, not restricted to the least square loss. We prove that our algorithm enjoys a linear convergence up to a constant error, which is competitive with the counterparts in the batch learning setting. We conduct extensive experiments to show the efficiency and effectiveness of the proposed algorithms.more » « less