skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Efficient and Private Marginal Reconstruction with Local Non-Negativity
Differential privacy is the dominant standard for formal and quantifiable privacy and has been used in major deployments that impact millions of people. Many differentially private algorithms for query release and synthetic data contain steps that reconstruct answers to queries from answers to other queries that have been measured privately. Reconstruction is an important subproblem for such mecha- nisms to economize the privacy budget, minimize error on reconstructed answers, and allow for scalability to high-dimensional datasets. In this paper, we introduce a principled and efficient postprocessing method ReM (Residuals-to-Marginals) for reconstructing answers to marginal queries. Our method builds on recent work on efficient mechanisms for marginal query release, based on making measurements using a residual query basis that admits efficient pseudoinversion, which is an important primitive used in reconstruction. An extension GReM-LNN (Gaussian Residuals-to-Marginals with Local Non-negativity) reconstructs marginals under Gaussian noise satisfying consistency and non-negativity, which often reduces error on reconstructed answers. We demonstrate the utility of ReM and GReM-LNN by applying them to improve existing private query answering mechanisms.  more » « less
Award ID(s):
2317232 1931686
PAR ID:
10574594
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
38th Conference on Neural Information Processing Systems (NeurIPS 2024).
Date Published:
ISSN:
0000-0000
Format(s):
Medium: X
Location:
38th Conference on Neural Information Processing Systems (NeurIPS 2024).
Sponsoring Org:
National Science Foundation
More Like this
  1. Privately releasing marginals of a tabular dataset is a foundational problem in differential privacy. However, state-of-the-art mechanisms suffer from a computational bottleneck when marginal estimates are reconstructed from noisy measurements. Recently, residual queries were introduced and shown to lead to highly efficient reconstruction in the batch query answering setting. We introduce new techniques to integrate residual queries into state-of-the-art adaptive mechanisms such as AIM. Our contributions include a novel conceptual framework for residual queries using multi-dimensional arrays, lazy updating strategies, and adaptive optimization of the per-round privacy budget allocation. Together these contributions reduce error, improve speed, and simplify residual query operations. We integrate these innovations into a new mechanism (AIM+GReM), which improves AIM by using fast residual-based reconstruction instead of a graphical model approach. Our mechanism is orders of magnitude faster than the original framework and demonstrates competitive error and greatly improved scalability. 
    more » « less
  2. Noisy marginals are a common form of confi dentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Matrix mechanisms, an important class of privacy mechanisms that provide unbiased noisy an swers to linear queries, are often used to answer marginal queries. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlan ner is both optimal (among matrix mechanisms that use Gaussian noise) and scalable for answering marginal queries, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previ ous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more com plex workloads that combine marginal and range/prefix sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that an swers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not nec essarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed. 
    more » « less
  3. null (Ed.)
    In practice, differentially private data releases are designed to support a variety of applications. A data release is fit for use if it meets target accuracy requirements for each application. In this paper, we consider the problem of answering linear queries under differential privacy subject to per-query accuracy constraints. Existing practical frameworks like the matrix mechanism do not provide such fine-grained control (they optimize total error, which allows some query answers to be more accurate than necessary, at the expense of other queries that become no longer useful). Thus, we design a fitness-for-use strategy that adds privacy-preserving Gaussian noise to query answers. The covariance structure of the noise is optimized to meet the fine-grained accuracy requirements while minimizing the cost to privacy. 
    more » « less
  4. Many differentially private algorithms for answering database queries involve a step that reconstructs a discrete data distribution from noisy measurements. This provides consistent query answers and reduces error, but often requires space that grows exponentially with dimension. Private-PGM is a recent approach that uses graphical models to represent the data distribution, with complexity proportional to that of exact marginal inference in a graphical model with structure determined by the co-occurrence of variables in the noisy measurements. Private-PGM is highly scalable for sparse measurements, but may fail to run in high dimensions with dense measurements. We overcome the main scalability limitation of Private-PGM through a principled approach that relaxes consistency constraints in the estimation objective. Our new approach works with many existing private query answering algorithms and improves scalability or accuracy with no privacy cost. 
    more » « less
  5. When analyzing confidential data through a privacy filter, a data scientist often needs to decide which queries will best support their intended analysis. For example, an analyst may wish to study noisy two-way marginals in a dataset produced by a mechanism M 1 . But, if the data are relatively sparse, the analyst may choose to examine noisy one-way marginals, produced by a mechanism M 2 , instead. Since the choice of whether to use M 1 or M 2 is data-dependent, a typical differentially private workflow is to first split the privacy loss budget ρ into two parts: ρ 1 and ρ 2 , then use the first part ρ 1 to determine which mechanism to use, and the remainder ρ 2 to obtain noisy answers from the chosen mechanism. In a sense, the first step seems wasteful because it takes away part of the privacy loss budget that could have been used to make the query answers more accurate. In this paper, we consider the question of whether the choice between M 1 and M 2 can be performed without wasting any privacy loss budget. For linear queries, we propose a method for decomposing M 1 and M 2 into three parts: (1) a mechanism M * that captures their shared information, (2) a mechanism M′1 that captures information that is specific to M 1 , (3) a mechanism M′2 that captures information that is specific to M 2 . Running M * and M′ 1 together is completely equivalent to running M 1 (both in terms of query answer accuracy and total privacy cost ρ ). Similarly, running M * and M′ 2 together is completely equivalent to running M 2 . Since M * will be used no matter what, the analyst can use its output to decide whether to subsequently run M ′ 1 (thus recreating the analysis supported by M 1 )or M′ 2 (recreating the analysis supported by M 2 ), without wasting privacy loss budget. 
    more » « less