We present a fast, differentially private algorithm for highdimensional covarianceaware mean estimation with nearly optimal sample complexity. Only exponentialtime estimators were previously known to achieve this guarantee. Given n samples from a (sub)Gaussian distribution with unknown mean μ and covariance Σ, our (ε,δ)differentially private estimator produces μ~ such that ∥μ−μ~∥Σ≤α as long as n≳dα2+dlog1/δ√αε+dlog1/δε. The Mahalanobis error metric ∥μ−μ^∥Σ measures the distance between μ^ and μ relative to Σ; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω−1+nd/ε), where ω<2.38 is the matrix multiplication exponent.
We adapt an exponentialtime approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.
Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With n≳d3/2 samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2) samples, answering an open question posed by Alabi et al. (2022). With n≳d2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.
Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.
more »
« less
How can classical multidimensional scaling go wrong?
Given a matrix D describing the pairwise dissimilarities of a data set, a common task is to embed the data points into Euclidean space. The classical multidimensional scaling (cMDS) algorithm is a widespread method to do this. However, theoretical analysis of the robustness of the algorithm and an indepth analysis of its performance on nonEuclidean metrics is lacking. In this paper, we derive a formula, based on the eigenvalues of a matrix obtained from D, for the Frobenius norm of the difference between D and the metric Dcmds returned by cMDS. This error analysis leads us to the conclusion that when the derived matrix has a significant number of negative eigenvalues, then ∥D−Dcmds∥F, after initially decreasing, willeventually increase as we increase the dimension. Hence, counterintuitively, the quality of the embedding degrades as we increase the dimension. We empirically verify that the Frobenius norm increases as we increase the dimension for a variety of nonEuclidean metrics. We also show on several benchmark datasets that this degradation in the embedding results in the classification accuracy of both simple (e.g., 1nearest neighbor) and complex (e.g., multilayer neural nets) classifiers decreasing as we increase the embedding dimension.Finally, our analysis leads us to a new efficiently computable algorithm that returns a matrix Dl that is at least as close to the original distances as Dt (the Euclidean metric closest in ℓ2 distance). While Dl is not metric, when given as input to cMDS instead of D, it empirically results in solutions whose distance to D does not increase when we increase the dimension and the classification accuracy degrades less than the cMDS solution.
more »
« less
 Award ID(s):
 1750780
 NSFPAR ID:
 10344286
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Page Range / eLocation ID:
 1230412315
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We present a fast, differentially private algorithm for highdimensional covarianceaware mean estimation with nearly optimal sample complexity. Only exponentialtime estimators were previously known to achieve this guarantee. Given n samples from a (sub)Gaussian distribution with unknown mean μ and covariance Σ, our (ϵ,δ)differentially private estimator produces μ~ such that ∥μ−μ~∥Σ≤α as long as n≳dα2+dlog1/δ√αϵ+dlog1/δϵ. The Mahalanobis error metric ∥μ−μ^∥Σ measures the distance between μ^ and μ relative to Σ; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω−1+nd/\eps), where ω<2.38 is the matrix multiplication exponent.We adapt an exponentialtime approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With n≳d3/2 samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2) samples, answering an open question posed by Alabi et al. (2022). With n≳d2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.more » « less

Banerjee, Arindam ; Zhou, ZhiHua (Ed.)To characterize the location (mean, median) of a set of graphs, one needs a notion of centrality that is adapted to metric spaces, since graph sets are not Euclidean spaces. A standard approach is to consider the Fr ́echet mean. In this work, we equip a set of graph with the pseudometric defined by the l2 norm between the eigenvalues of their respective adjacency matrix. Unlike the edit distance, this pseudometric reveals structural changes at multiple scales, and is well adapted to studying various statistical problems for graphvalued data. We describe an algorithm to compute an approximation to the sample Fr ́echet mean of a set of undirected unweighted graphs with a fixed size using this pseudometric.more » « less

Summary Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multisample approach to estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real datasets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. Theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback–Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.more » « less

This article presents a new method to compute a selfintersection free highdimensional Euclidean embedding (SIFHDE^2) for surfaces and volumes equipped with an arbitrary Riemannian metric. It is already known that given a highdimensional (highd) embedding, one can easily compute an anisotropic Voronoi diagram by backmapping it to 3D space. We show here how to solve the inverse problem, i.e., given an input metric, compute a smooth intersectionfree highd embedding of the input such that the pullback metric of the embedding matches the input metric. Our numerical solution mechanism matches the deformation gradient of the 3D > higherd mapping with the given Riemannian metric. We demonstrate the applicability of our method, by using it to construct anisotropic Restricted Voronoi Diagram (RVD) and anisotropic meshing, that are otherwise extremely difficult to compute. In SIFHDE^2 space constructed by our algorithm, difficult 3D anisotropic computations are replaced with simple Euclidean computations, resulting in an isotropic RVD and its dual mesh on this highd embedding. Results are compared with the stateoftheart in anisotropic surface and volume meshings using several examples and evaluation metrics.more » « less