We consider the problem of privately estimating a parameter 𝔼[h(X1,…,Xk)], where X1, X2, …, Xk are i.i.d. data from some distribution and h is a permutation-invariant function. Without privacy constraints, standard estimators are U-statistics, which commonly arise in a wide range of problems, including nonparametric signed rank tests, symmetry testing, uniformity testing, and subgraph counts in random networks, and can be shown to be minimum variance unbiased estimators under mild conditions. Despite the recent outpouring of interest in private mean estimation, privatizing U-statistics has received little attention. While existing private mean estimation algorithms can be applied to obtain confidence intervals, we show that they can lead to suboptimal private error, e.g., constant-factor inflation in the leading term, or even Θ(1/n) rather than O(1/n²) in degenerate settings. To remedy this, we propose a new thresholding-based approach using local Hájek projections to reweight different subsets of the data. This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.
more »
« less
Differentially private estimation of U statistics
In this paper, we consider the problem of private estimation of U statistics. U statistics are widely used estimators that naturally arise in a broad class of problems, from nonparametric signed rank tests to subgraph counts in random networks. They are simply averages of an appropriate kernel applied to all subsets of a given size (also known as the degree) of sample size n. However, despite the recent outpouring of interest in private mean estimation, private algorithms for more general U statistics have received little attention. We propose a framework where, for a broad class of U statistics, one can use existing tools in private mean estimation to obtain confidence intervals where the private error does not overwhelm the irreducible error resulting from the variance of the U statistics. However, in specific cases that arise when the U statistics degenerate or have vanishing moments, the private error may be of a larger order than the non-private error. To remedy this, we propose a new thresholding-based approach that uses Hajek projections to re-weight different subsets. As we show, this leads to more accurate inference in certain settings.
more »
« less
- Award ID(s):
- 2019844
- PAR ID:
- 10503105
- Editor(s):
- Under Review for COLT 2024
- Publisher / Repository:
- Proceedings of Machine Learning Research
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 196
- ISSN:
- 2640-3498
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur suboptimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a 1 + o(1)-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an optimal algorithm such as PrivUnitG in the lower-dimensional space. In addition, we show that, by appropriately correlating the random projection matrices across devices, we can achieve fast server run-time. We mathematically analyze the error of the algorithm in terms of properties of the random projections, and study two instantiations. Lastly, our experiments for private mean estimation and private federated learning demonstrate that our algorithms empirically obtain nearly the same utility as optimal ones while having significantly lower communication and computational cost.more » « less
-
We study the problem of locally private mean estimation of high-dimensional vectors in the Euclidean ball. Existing algorithms for this problem either incur sub-optimal error or have high communication and/or run-time complexity. We propose a new algorithmic framework, ProjUnit, for private mean estimation that yields algorithms that are computationally efficient, have low communication complexity, and incur optimal error up to a 1+o(1)-factor. Our framework is deceptively simple: each randomizer projects its input to a random low-dimensional subspace, normalizes the result, and then runs an optimal algorithm such as PrivUnitG in the lower-dimensional space. In addition, we show that, by appropriately correlating the random projection matrices across devices, we can achieve fast server run-time. We mathematically analyze the error of the algorithm in terms of properties of the random projections, and study two instantiations. Lastly, our experiments for private mean estimation and private federated learning demonstrate that our algorithms empirically obtain nearly the same utility as optimal ones while having significantly lower communication and computational cost.more » « less
-
Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the highdimensional setting. For data with inliers generated from a Gaussian with known covariance, we find experimentally that several robust mean estimation techniques can practically improve upon the sample mean, with the quantum entropy scaling approach from Dong et.al. (NeurIPS 2019) performing consistently the best. However, this consistent improvement is conditioned on a couple of simple modifications to how the steps to prune outliers work in the high-dimension low-data setting, and when the inliers deviate significantly from Gaussianity. In fact, with these modifications, they are typically able to achieve roughly the same error as taking the sample mean of the uncorrupted inlier data, even with very low data size. In addition to controlled experiments on synthetic data, we also explore these methods on large language models, deep pretrained image models, and non-contextual word embedding models that do not necessarily have an inherent Gaussian distribution. Yet, in these settings, a mean point of a set of embedded objects is a desirable quantity to learn, and the data exhibits the high-dimension low-data setting studied in this paper. We show both the challenges of achieving this goal, and that our updated robust mean estimation methods can provide significant improvement over using just the sample mean. We additionally publish a library of Python implementations of robust mean estimation algorithms, allowing practitioners and researchers to apply these techniques and to perform further experimentation.more » « less
-
We study the problem of private vector mean estimation in the shuffle model of privacy where n users each have a unit vector v^{(i)} in R^d. We propose a new multi-message protocol that achieves the optimal error using O~(min(n*epsilon^2, d)) messages per user. Moreover, we show that any (unbiased) protocol that achieves optimal error requires each user to send Omega(min(n*epsilon^2,d)/log(n)) messages, demonstrating the optimality of our message complexity up to logarithmic factors. Additionally, we study the single-message setting and design a protocol that achieves mean squared error O(dn^{d/(d+2)} * epsilon^{-4/(d+2)}). Moreover, we show that any single-message protocol must incur mean squared error Omega(dn^{d/(d+2)}), showing that our protocol is optimal in the standard setting where epsilon = Theta(1). Finally, we study robustness to malicious users and show that malicious users can incur large additive error with a single shuffler.more » « less
An official website of the United States government

