We study robust testing and estimation of discrete distributions in the strong contamination model. Our results cover both centralized setting and distributed setting with general local information constraints including communication and LDP constraints. Our technique relates the strength of manipulation attacks to the earth-mover distance using Hamming distance as the metric between messages (samples) from the users. In the centralized setting, we provide optimal error bounds for both learning and testing. Our lower bounds under local information constraints build on the recent lower bound methods in distributed inference. In the communication constrained setting, we develop novel algorithms based on random hashing and an L1-L1 isometry.
more »
« less
Collaborative Learning of Discrete Distributions under Heterogeneity and Communication Constraints
In modern machine learning, users often have to collaborate to learn distributions that generate the data. Communication can be a significant bottleneck. Prior work has studied homogeneous users—i.e., whose data follow the same discrete distribution—and has provided optimal communication-efficient methods. How- ever, these methods rely heavily on homogeneity, and are less applicable in the common case when users’ discrete distributions are heterogeneous. Here we consider a natural and tractable model of heterogeneity, where users’ discrete distributions only vary sparsely, on a small number of entries. We propose a novel two-stage method named SHIFT: First, the users collaborate by communicating with the server to learn a central distribution; relying on methods from robust statistics. Then, the learned central distribution is fine-tuned to estimate the indi- vidual distributions of users. We show that our method is minimax optimal in our model of heterogeneity and under communication constraints. Further, we provide experimental results using both synthetic data and n-gram frequency estimation in the text domain, which corroborate its efficiency.
more »
« less
- Award ID(s):
- 1943064
- PAR ID:
- 10434250
- Date Published:
- Journal Name:
- Advances in neural information processing systems
- ISSN:
- 1049-5258
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A central theme in federated learning (FL) is the fact that client data distributions are often not independent and identically distributed (IID), which has strong implications on the training process. While most existing FL algorithms focus on the conventional non-IID setting of class imbalance or missing classes across clients, in practice, the distribution differences could be more complex, e.g., changes in class conditional (domain) distributions. In this paper, we consider this complex case in FL wherein each client has access to only one domain distribution. For tasks such as domain generalization, most existing learning algorithms require access to data from multiple clients (i.e., from multiple domains) during training, which is prohibitive in FL. To address this challenge, we propose a federated domain translation method that generates pseudodata for each client which could be useful for multiple downstream learning tasks. We empirically demonstrate that our translation model is more resource-efficient (in terms of both communication and computation) and easier to train in an FL setting than standard domain translation methods. Furthermore, we demonstrate that the learned translation model enables use of state-of-the-art domain generalization methods in a federated setting, which enhances accuracy and robustness to increases in the synchronization period compared to existing methodology.more » « less
-
The next-generation spectrum access system (SAS) for the Citizens Broadband Radio Service band is equipped with environmental sensors (ESCs) to detect the presence of noninformed incumbent users, which allows the SAS to dynamically reassign spectrum resource for low privilege users to avoid interference. However, the performance of existing single-node detection model is limited by the sensor’s geo-locations; whereas a naive distributed sensing network with improved detection accuracy introduces a high bandwidth overhead due to the frequent communication of spectrum data. In addition, many existing coherent spectrum sensing methods are not feasible for CBRS band due to the unknown operational characteristics of incumbent military wireless applications. To address these issues, we propose a machine learning based non-coherent spectrum sensing system: (F)eder(a)ted (I)ncumbent Detection in CB(R)S (FaIR). FaIR leverages a communication-efficient distributed learning framework, federated learning, for ESCs to collaborate and train a data-driven machine learning model for incumbent detection under minimal communication bandwidth. Our preliminary results show that the federated learning method can exploit the spatial diversity of ESCs and obtain an improved detection model comparing to a naive distributed sensing and centralized model framework. We evaluate the FaIR model with a variety of spectrum waveforms at varying SNRs. Our experiments showed that FaIR improves the average detection accuracy compared to the single-node method, using a fraction of the bandwidth compared to the naive multinode method.more » « less
-
Decentralized learning has emerged as an alternative method to the popular parameter-server framework which suffers from high communication burden, single-point failure and scalability issues due to the need of a central server. However, most existing works focus on a single shared model for all workers regardless of the data heterogeneity problem, rendering the resulting model performing poorly on individual workers. In this work, we propose a novel personalized decentralized learning algorithm named DePRL via shared representations. Our algorithm relies on ideas from representation learning theory to learn a low-dimensional global representation collaboratively among all workers in a fully decentralized manner, as well as a user-specific low-dimensional local head leading to a personalized solution for each worker. We show that DePRL achieves, for the first time, a provable \textit{linear speedup for convergence} with general non-linear representations (i.e., the convergence rate is improved linearly with respect to the number of workers). Experimental results support our theoretical findings showing the superiority of our method in data heterogeneous environments.more » « less
-
We study discrete panel data methods where unobserved heterogeneity is revealed in a first step, in environments where population heterogeneity is not discrete. We focus on two‐step grouped fixed‐effects (GFE) estimators, where individuals are first classified into groups using kmeans clustering, and the model is then estimated allowing for group‐specific heterogeneity. Our framework relies on two key properties: heterogeneity is a function—possibly nonlinear and time‐varying—of a low‐dimensional continuous latent type, and informative moments are available for classification. We illustrate the method in a model of wages and labor market participation, and in a probit model with time‐varying heterogeneity. We derive asymptotic expansions of two‐step GFE estimators as the number of groups grows with the two dimensions of the panel. We propose a data‐driven rule for the number of groups, and discuss bias reduction and inference.more » « less
An official website of the United States government

