We study the problem of communicationefficient distributed vector mean estimation, which is a commonly used subroutine in distributed optimization and Federated Learning (FL). Randk sparsification is a commonly used technique to reduce communication cost, where each client sends of its coordinates to the server. However, Randk is agnostic to any correlations, that might exist between clients in practical scenarios. The recently proposed RandkSpatial estimator leverages the crossclient correlation information at the server to improve Randk's performance. Yet, the performance of RandkSpatial is suboptimal, and improving mean estimation is key to faster convergence in distributed optimization. We propose the RandProjSpatial estimator with a more flexible encodingdecoding procedure, which generalizes the encoding of Rand
by projecting the client vectors to a random kdimensional subspace. We utilize Subsampled Randomized Hadamard Transform (SRHT) as the projection matrix and show that RandProjSpatial with SRHT outperforms RandkSpatial, using the correlation information more efficiently. Furthermore, we propose an approach to incorporate varying degrees of correlation and suggest a practical variant of RandProjSpatial when the correlation information is not available to the server. Finally, experiments on realworld distributed optimization tasks showcase the superior performance of RandProjSpatial compared to RandkSpatial and other more sophisticated sparsification techniques.
more »
« less
Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation
We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are highdimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, in many practical applications such as federated learning, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments for PCA, KMeans and Logistic Regression, which show that our estimators consistently outperform more sophisticated and expensive sparsification methods.
more »
« less
 Award ID(s):
 2045694
 NSFPAR ID:
 10315454
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study stochastic gradient descent (SGD) with local iterations in the presence of malicious/Byzantine clients, motivated by the federated learning. The clients, instead of communicating with the central server in every iteration, maintain their local models, which they update by taking several SGD iterations based on their own datasets and then communicate the net update with the server, thereby achieving communicationefficiency. Furthermore, only a subset of clients communicate with the server, and this subset may be different at different synchronization times. The Byzantine clients may collaborate and send arbitrary vectors to the server to disrupt the learning process. To combat the adversary, we employ an efficient highdimensional robust mean estimation algorithm from Steinhardt et al.~i̧te[ITCS 2018]Resilience_SCV18 at the server to filterout corrupt vectors; and to analyze the outlierfiltering procedure, we develop a novel matrix concentration result that may be of independent interest. We provide convergence analyses for stronglyconvex and nonconvex smooth objectives in the heterogeneous data setting, where different clients may have different local datasets, and we do not make any probabilistic assumptions on data generation. We believe that ours is the first Byzantineresilient algorithm and analysis with local iterations. We derive our convergence results under minimal assumptions of bounded variance for SGD and bounded gradient dissimilarity (which captures heterogeneity among local datasets). We also extend our results to the case when clients compute fullbatch gradients.more » « less

We study stochastic gradient descent (SGD) with local iterations in the presence of malicious/Byzantine clients, motivated by the federated learning. The clients, instead of communicating with the central server in every iteration, maintain their local models, which they update by taking several SGD iterations based on their own datasets and then communicate the net update with the server, thereby achieving communicationefficiency. Furthermore, only a subset of clients communicate with the server, and this subset may be different at different synchronization times. The Byzantine clients may collaborate and send arbitrary vectors to the server to disrupt the learning process. To combat the adversary, we employ an efficient highdimensional robust mean estimation algorithm from Steinhardt et al.at the server to filterout corrupt vectors; and to analyze the outlierfiltering procedure, we develop a novel matrix concentration result that may be of independent interest. We provide convergence analyses for stronglyconvex and nonconvex smooth objectives in the heterogeneous data setting, where different clients may have different local datasets, and we do not make any probabilistic assumptions on data generation. We believe that ours is the first Byzantineresilient algorithm and analysis with local iterations. We derive our convergence results under minimal assumptions of bounded variance for SGD and bounded gradient dissimilarity (which captures heterogeneity among local datasets). We also extend our results to the case when clients compute fullbatch gradients.more » « less

The popular Random Dot Product Graph (RDPG) generative model postulates that each node has an associated (latent) vector, and the probability of existence of an edge between two nodes is their innerproduct (with variants to consider directed and weighted graphs). In any case, the latent vectors may be estimated through a spectral decomposition of the adjacency matrix, the socalled Adjacency Spectral Embedding (ASE). Assume we are monitoring a stream of graphs and the objective is to track the latent vectors. Examples include recommender systems or monitoring of a wireless network. It is clear that performing the ASE of each graph separately may result in a prohibitive computation load. Furthermore, the invariance to rotations of the inner product complicates comparing the latent vectors at different timesteps. By considering the minimization problem underlying ASE, we develop an iterative algorithm that updates the latent vectors' estimation as new graphs from the stream arrive. Differently to other proposals, our method does not accumulate errors and thus does not requires periodically recomputing the spectral decomposition. Furthermore, the pragmatic setting where nodes leave or join the graph (e.g. a new product in the recommender system) can be accommodated as well. Our code is available at https://github.com/marfiori/efficientASEmore » « less

Many networked systems such as electric networks, the brain, and social networks of opinion dynamics are known to obey conservation laws. Examples of this phenomenon include the Kirchoff laws in electric networks and opinion consensus in social networks. Conservation laws in networked systems are modeled as balance equations of the form X=B*Y, where the sparsity pattern of B*∈R^{p×p} captures the connectivity of the network on p nodes, and Y, X ∈ R^p are vectors of ''potentials'' and ''injected flows'' at the nodes respectively. The node potentials Y cause flows across edges which aim to balance out the potential difference, and the flows X injected at the nodes are extraneous to the network dynamics. In several practical systems, the network structure is often unknown and needs to be estimated from data to facilitate modeling, management, and control. To this end, one has access to samples of the node potentials Y, but only the statistics of the node injections X. Motivated by this important problem, we study the estimation of the sparsity structure of the matrix B* from n samples of Y under the assumption that the node injections X follow a Gaussian distribution with a known covariance Σ_X. We propose a new ℓ1regularized maximum likelihood estimator for tackling this problem in the highdimensional regime where the size of the network may be vastly larger than the number of samples n. We show that this optimization problem is convex in the objective and admits a unique solution. Under a new mutual incoherence condition, we establish sufficient conditions on the triple (n,p,d) for which exact sparsity recovery of B* is possible with high probability; d is the degree of the underlying graph. We also establish guarantees for the recovery of B* in the elementwise maximum, Frobenius, and operator norms. Finally, we complement these theoretical results with experimental validation of the performance of the proposed estimator on synthetic and realworld data.more » « less