skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Covariance’s Loss is Privacy’s Gain: Computationally Efficient, Private and Accurate Synthetic Data
Abstract The protection of private information is of vital importance in data-driven research, business and government. The conflict between privacy and utility has triggered intensive research in the computer science and statistics communities, who have developed a variety of methods for privacy-preserving data release. Among the main concepts that have emerged are anonymity and differential privacy. Today, another solution is gaining traction, synthetic data. However, the road to privacy is paved with NP-hard problems. In this paper, we focus on the NP-hard challenge to develop a synthetic data generation method that is computationally efficient, comes with provable privacy guarantees and rigorously quantifies data utility. We solve a relaxed version of this problem by studying a fundamental, but a first glance completely unrelated, problem in probability concerning the concept of covariance loss. Namely, we find a nearly optimal and constructive answer to the question how much information is lost when we take conditional expectation. Surprisingly, this excursion into theoretical probability produces mathematical techniques that allow us to derive constructive, approximately optimal solutions to difficult applied problems concerning microaggregation, privacy and synthetic data.  more » « less
Award ID(s):
2140592 1954233 2027299 1934568 2031883
PAR ID:
10356613
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Foundations of Computational Mathematics
ISSN:
1615-3375
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Miller, Tim; Hoffman, Robert; Amir, Ofra; Holzinger, Andreas (Ed.)
    There is a growing interest within the AI research community in developing autonomous systems capable of explaining their behavior to users. However, the problem of computing explanations for users of different levels of expertise has received little research attention. We propose an approach for addressing this problem by representing the user's understanding of the task as an abstraction of the domain model that the planner uses. We present algorithms for generating minimal explanations in cases where this abstract human model is not known. We reduce the problem of generating an explanation to a search over the space of abstract models and show that while the complete problem is NP-hard, a greedy algorithm can provide good approximations of the optimal solution. We empirically show that our approach can efficiently compute explanations for a variety of problems and also perform user studies to test the utility of state abstractions in explanations. 
    more » « less
  2. null (Ed.)
    Uncertainty is an omnipresent issue in real-world optimization problems. This paper studies a fundamental problem concerning uncertainty, known as the β-robust scheduling problem. Given a set of identical machines and a set of jobs whose processing times follow a normal distribution, the goal is to assign jobs to machines such that the probability that all the jobs are completed by a given common due date is maximized. We give the first systematic study on the complexity and algorithms for this problem. A strong negative result is shown by ruling out the existence of any polynomial-time algorithm with a constant approximation ratio for the general problem unless P=NP. On the positive side, we provide the first FPT-AS (fixed parameter tractable approximation scheme) parameterized by the number of different kinds of jobs, which is a common parameter in scheduling problems. It returns a solution arbitrarily close to the optimal solution, provided that the job processing times follow a few different types of distributions. We further complement the theoretical results by implementing our algorithm. The experiments demonstrate that by choosing an appropriate approximation ratio, the algorithm can efficiently compute a near-optimal solution. 
    more » « less
  3. We consider the problem of timely exchange of updates between a central station and a set of ground terminals V , via a mobile agent that traverses across the ground terminals along a mobility graph G = (V;E). We design the trajectory of the mobile agent to minimize peak and average age of information (AoI), two newly proposed metrics for measuring timeliness of information. We consider randomized trajectories, in which the mobile agent travels from terminal i to terminal j with probability Pi;j . For the information gathering problem, we show that a randomized trajectory is peak age optimal and factor-8H average age optimal, where H is the mixing time of the randomized trajectory on the mobility graph G. We also show that the average age minimization problem is NP-hard. For the information dissemination problem, we prove that the same randomized trajectory is factor-O(H) peak and average age optimal. Moreover, we propose an age-based trajectory, which utilizes information about current age at terminals, and show that it is factor-2 average age optimal in a symmetric setting. 
    more » « less
  4. null (Ed.)
    Consider the following two fundamental open problems in complexity theory: • Does a hard-on-average language in NP imply the existence of one-way functions? • Does a hard-on-average language in NP imply a hard-on-average problem in TFNP (i.e., the class of total NP search problem)? Our main result is that the answer to (at least) one of these questions is yes. Both one-way functions and problems in TFNP can be interpreted as promise-true distributional NP search problems—namely, distributional search problems where the sampler only samples true statements. As a direct corollary of the above result, we thus get that the existence of a hard-on-average distributional NP search problem implies a hard-on-average promise-true distributional NP search problem. In other words, It is no easier to find witnesses (a.k.a. proofs) for efficiently-sampled statements (theorems) that are guaranteed to be true. This result follows from a more general study of interactive puzzles—a generalization of average-case hardness in NP—and in particular, a novel round-collapse theorem for computationallysound protocols, analogous to Babai-Moran’s celebrated round-collapse theorem for informationtheoretically sound protocols. As another consequence of this treatment, we show that the existence of O(1)-round public-coin non-trivial arguments (i.e., argument systems that are not proofs) imply the existence of a hard-on-average problem in NP/poly. 
    more » « less
  5. Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce OTClean, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints. 
    more » « less