skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: AIM: an adaptive and iterative mechanism for differentially private synthetic data
We propose AIM, a new algorithm for differentially private synthetic data generation. AIM is a workload-adaptive algorithm within the paradigm of algorithms that first selects a set of queries, then privately measures those queries, and finally generates synthetic data from the noisy measurements. It uses a set of innovative features to iteratively select the most useful measurements, reflecting both their relevance to the workload and their value in approximating the input data. We also provide analytic expressions to bound per-query error with high probability which can be used to construct confidence intervals and inform users about the accuracy of generated data. We show empirically that AIM consistently outperforms a wide variety of existing mechanisms across a variety of experimental settings.  more » « less
Award ID(s):
1954814
PAR ID:
10359064
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
15
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
2599 to 2612
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We present three new algorithms for constructing differentially private synthetic data—a sanitized version of a sensitive dataset that approximately preserves the answers to a large collection of statistical queries. All three algorithms are \emph{oracle-efficient} in the sense that they are computationally efficient when given access to an optimization oracle. Such an oracle can be implemented using many existing (non-private) optimization tools such as sophisticated integer program solvers. While the accuracy of the synthetic data is contingent on the oracle’s optimization performance, the algorithms satisfy differential privacy even in the worst case. For all three algorithms, we provide theoretical guarantees for both accuracy and privacy. Through empirical evaluation, we demonstrate that our methods scale well with both the dimensionality of the data and the number of queries. Compared to the state-of-the-art method High-Dimensional Matrix Mechanism (McKenna et al. VLDB 2018), our algorithms provide better accuracy in the large workload and high privacy regime (corresponding to low privacy loss epsilon). 
    more » « less
  2. Given a self-join-free conjunctive queryQand a set of tuplesS, asynthetic witness Dis a database instance such that the result ofQonDisS. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witnessDexists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to thesmallest witness problemrecently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as therole miningproblem in data mining and theedge concentrationproblem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., fortest database generationfor applications accessing a database and fordata compressionby encoding a datasetSas a pair of a queryQand databaseD. We prove that ESW is in P, presenting a simple algorithm that, given anyS, decides whether a synthetic witness exists in polynomial time in the size ofS. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size ofSfor any queryQthat has thehead-dominationproperty. IfQdoes not have such a property, then SSW is generally hard. More specifically, we show that for the class ofpath queries(of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class ofBerge-acyclicqueries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP. 
    more » « less
  3. We present a novel method for symbolic regression (SR), the task of searching for compact programmatic hypotheses that best explain a dataset. The problem is commonly solved using genetic algorithms; we show that we can enhance such methods by inducing a library of abstract textual concepts. Our algorithm, called LaSR, uses zero-shot queries to a large language model (LLM) to discover and evolve concepts occurring in known high-performing hypotheses. We discover new hypotheses using a mix of standard evolutionary steps and LLM-guided steps (obtained through zero-shot LLM queries) conditioned on discovered concepts. Once discovered, hypotheses are used in a new round of concept abstraction and evolution. We validate LaSR on the Feynman equations, a popular SR benchmark, as well as a set of synthetic tasks. On these benchmarks, LaSR substantially outperforms a variety of state-of-the-art SR approaches based on deep learning and evolutionary algorithms. Moreover, we show that LASR can be used to discover a new and powerful scaling law for LLMs. 
    more » « less
  4. null (Ed.)
    Modern video data management systems store videos as a single encoded file, which significantly limits possible storage level optimizations. We design, implement, and evaluate TASM, a new tile-based storage manager for video data. TASM uses a feature in modern video codecs called "tiles" that enables spatial random access into encoded videos. TASM physically tunes stored videos by optimizing their tile layouts given the video content and a query workload. Additionally, TASM dynamically tunes that layout in response to changes in the query workload or if the query workload and video contents are incrementally discovered. Finally, TASM also produces efficient initial tile layouts for newly ingested videos. We demonstrate that TASM can speed up subframe selection queries by an average of over 50% and up to 94%. TASM can also improve the throughput of the full scan phase of object detection queries by up to 2×. 
    more » « less
  5. Cormode, Graham; Shekelyan, Michael (Ed.)
    We are given a set 𝒵 = {(R_1,s_1), …, (R_n,s_n)}, where each R_i is a range in ℝ^d, such as rectangle or ball, and s_i ∈ [0,1] denotes its selectivity. The goal is to compute a small-size discrete data distribution 𝒟 = {(q₁,w₁),…, (q_m,w_m)}, where q_j ∈ ℝ^d and w_j ∈ [0,1] for each 1 ≤ j ≤ m, and ∑_{1≤j≤m} w_j = 1, such that 𝒟 is the most consistent with 𝒵, i.e., err_p(𝒟,𝒵) = 1/n ∑_{i = 1}ⁿ |s_i - ∑_{j=1}^m w_j⋅1(q_j ∈ R_i)|^p is minimized. In a database setting, 𝒵 corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and 𝒟 can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δ^{-d}) δ^{-2} polylog n), a discrete distribution 𝒟̃ of size O(δ^{-2}), such that err_p(𝒟̃,𝒵) ≤ min_𝒟 err_p(𝒟,𝒵)+δ (for p = 1,2,∞) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely. 
    more » « less