skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Statistical Approximating Distributions Under Differential Privacy
Statistics computed from data are viewed as random variables. When they are used for tasks like hypothesis testing and confidence intervals, their true finite sample distributions are often replaced by approximating distributions that are easier to work with (for example, the Gaussian, which results from using approximations justified by the Central Limit Theorem). When data are perturbed by differential privacy, the approximating distributions also need to be modified. Prior work provided various competing methods for creating such approximating distributions with little formal justification beyond the fact that they worked well empirically. In this paper, we study the question of how to generate statistical approximating distributions for differentially private statistics, provide finite sample guarantees for the quality of the approximations.  more » « less
Award ID(s):
1702760 1228669 1947919
PAR ID:
10085740
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Journal of Privacy and Confidentiality
Volume:
8
Issue:
1
ISSN:
2575-8527
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Estimating the mean of a probability distribution using i.i.d. samples is a classical problem in statistics, wherein finite-sample optimal estimators are sought under various distributional assumptions. In this paper, we consider the problem of mean estimation when independent samples are drawn from $$d$$-dimensional non-identical distributions possessing a common mean. When the distributions are radially symmetric and unimodal, we propose a novel estimator, which is a hybrid of the modal interval, shorth and median estimators and whose performance adapts to the level of heterogeneity in the data. We show that our estimator is near optimal when data are i.i.d. and when the fraction of ‘low-noise’ distributions is as small as $$\varOmega \left (\frac{d \log n}{n}\right )$$, where $$n$$ is the number of samples. We also derive minimax lower bounds on the expected error of any estimator that is agnostic to the scales of individual data points. Finally, we extend our theory to linear regression. In both the mean estimation and regression settings, we present computationally feasible versions of our estimators that run in time polynomial in the number of data points. 
    more » « less
  2. Abstract Approximating a function with a finite series, e.g., involving polynomials or trigonometric functions, is a critical tool in computing and data analysis. The construction of such approximations via now-standard approaches like least squares or compressive sampling does not ensure that the approximation adheres to certain convex linear structural constraints, such as positivity or monotonicity. Existing approaches that ensure such structure are norm-dissipative and this can have a deleterious impact when applying these approaches, e.g., when numerical solving partial differential equations. We present a new framework that enforces via optimization such structure on approximations and is simultaneously norm-preserving. This results in a conceptually simple convex optimization problem on the sphere, but the feasible set for such problems can be very complex. We establish well-posedness of the optimization problem through results on spherical convexity and design several spherical-projection-based algorithms to numerically compute the solution. Finally, we demonstrate the effectiveness of this approach through several numerical examples. 
    more » « less
  3. Abstract Obtaining lightweight and accurate approximations of discretized objective functional Hessians in inverse problems governed by partial differential equations (PDEs) is essential to make both deterministic and Bayesian statistical large-scale inverse problems computationally tractable. The cubic computational complexity of dense linear algebraic tasks, such as Cholesky factorization, that provide a means to sample Gaussian distributions and determine solutions of Newton linear systems is a computational bottleneck at large-scale. These tasks can be reduced to log-linear complexity by utilizing hierarchical off-diagonal low-rank (HODLR) matrix approximations. In this work, we show that a class of Hessians that arise from inverse problems governed by PDEs are well approximated by the HODLR matrix format. In particular, we study inverse problems governed by PDEs that model the instantaneous viscous flow of ice sheets. In these problems, we seek a spatially distributed basal sliding parameter field such that the flow predicted by the ice sheet model is consistent with ice sheet surface velocity observations. We demonstrate the use of HODLR Hessian approximation to efficiently sample the Laplace approximation of the posterior distribution with covariance further approximated by HODLR matrix compression. Computational studies are performed which illustrate ice sheet problem regimes for which the Gauss–Newton data-misfit Hessian is more efficiently approximated by the HODLR matrix format than the low-rank (LR) format. We then demonstrate that HODLR approximations can be favorable, when compared to global LR approximations, for large-scale problems by studying the data-misfit Hessian associated with inverse problems governed by the first-order Stokes flow model on the Humboldt glacier and Greenland ice sheet. 
    more » « less
  4. We consider the problem of efficiently approximating and encoding high-dimensional data sampled from a probability distribution ρ in ℝD, that is nearly supported on a d-dimensional set  - for example supported on a d-dimensional manifold. Geometric Multi-Resolution Analysis (GMRA) provides a robust and computationally efficient procedure to construct low-dimensional geometric approximations of  at varying resolutions. We introduce GMRA approximations that adapt to the unknown regularity of , by introducing a thresholding algorithm on the geometric wavelet coefficients. We show that these data-driven, empirical geometric approximations perform well, when the threshold is chosen as a suitable universal function of the number of samples n, on a large class of measures ρ, that are allowed to exhibit different regularity at different scales and locations, thereby efficiently encoding data from more complex measures than those supported on manifolds. These GMRA approximations are associated to a dictionary, together with a fast transform mapping data to d-dimensional coefficients, and an inverse of such a map, all of which are data-driven. The algorithms for both the dictionary construction and the transforms have complexity CDnlogn with the constant C exponential in d. Our work therefore establishes Adaptive GMRA as a fast dictionary learning algorithm, with approximation guarantees, for intrinsically low-dimensional data. We include several numerical experiments on both synthetic and real data, confirming our theoretical results and demonstrating the effectiveness of Adaptive GMRA. 
    more » « less
  5. Megow, Nicole; Smith, Adam (Ed.)
    We revisit the classic Pandora’s Box (PB) problem under correlated distributions on the box values. Recent work of [Shuchi Chawla et al., 2020] obtained constant approximate algorithms for a restricted class of policies for the problem that visit boxes in a fixed order. In this work, we study the complexity of approximating the optimal policy which may adaptively choose which box to visit next based on the values seen so far. Our main result establishes an approximation-preserving equivalence of PB to the well studied Uniform Decision Tree (UDT) problem from stochastic optimization and a variant of the Min-Sum Set Cover (MSSC_f) problem. For distributions of support m, UDT admits a log m approximation, and while a constant factor approximation in polynomial time is a long-standing open problem, constant factor approximations are achievable in subexponential time [Ray Li et al., 2020]. Our main result implies that the same properties hold for PB and MSSC_f. We also study the case where the distribution over values is given more succinctly as a mixture of m product distributions. This problem is again related to a noisy variant of the Optimal Decision Tree which is significantly more challenging. We give a constant-factor approximation that runs in time n^Õ(m²/ε²) when the mixture components on every box are either identical or separated in TV distance by ε. 
    more » « less