skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Fienberg Problem: How to Allow Human Interactive Data Analysis in the Age of Differential Privacy
Differential Privacy is a popular technology for privacy-preserving analysis of large datasets. DP is powerful, but it requires that the analyst interact with data only through a special interface; in particular, the analyst does not see raw data, an uncomfortable situation for anyone trained in classical statistical data analysis. In this note we discuss the (overly) simple problem of allowing a  trusted analyst to choose an ``"interesting" statistic for popular release (the actual computation of the chosen statistic will be carried out in a differentially private way).  more » « less
Award ID(s):
1763665
PAR ID:
10217353
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of Privacy and Confidentiality
Volume:
8
Issue:
1
ISSN:
2575-8527
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. When analyzing confidential data through a privacy filter, a data scientist often needs to decide which queries will best support their intended analysis. For example, an analyst may wish to study noisy two-way marginals in a dataset produced by a mechanism M 1 . But, if the data are relatively sparse, the analyst may choose to examine noisy one-way marginals, produced by a mechanism M 2 , instead. Since the choice of whether to use M 1 or M 2 is data-dependent, a typical differentially private workflow is to first split the privacy loss budget ρ into two parts: ρ 1 and ρ 2 , then use the first part ρ 1 to determine which mechanism to use, and the remainder ρ 2 to obtain noisy answers from the chosen mechanism. In a sense, the first step seems wasteful because it takes away part of the privacy loss budget that could have been used to make the query answers more accurate. In this paper, we consider the question of whether the choice between M 1 and M 2 can be performed without wasting any privacy loss budget. For linear queries, we propose a method for decomposing M 1 and M 2 into three parts: (1) a mechanism M * that captures their shared information, (2) a mechanism M′1 that captures information that is specific to M 1 , (3) a mechanism M′2 that captures information that is specific to M 2 . Running M * and M′ 1 together is completely equivalent to running M 1 (both in terms of query answer accuracy and total privacy cost ρ ). Similarly, running M * and M′ 2 together is completely equivalent to running M 2 . Since M * will be used no matter what, the analyst can use its output to decide whether to subsequently run M ′ 1 (thus recreating the analysis supported by M 1 )or M′ 2 (recreating the analysis supported by M 2 ), without wasting privacy loss budget. 
    more » « less
  2. Abstract Hypothesis testing is one of the most common types of data analysis and forms the backbone of scientific research in many disciplines. Analysis of variance (ANOVA) in particular is used to detect dependence between a categorical and a numerical variable. Here we show how one can carry out this hypothesis test under the restrictions of differential privacy. We show that the F -statistic, the optimal test statistic in the public setting, is no longer optimal in the private setting, and we develop a new test statistic F 1 with much higher statistical power. We show how to rigorously compute a reference distribution for the F 1 statistic and give an algorithm that outputs accurate p -values. We implement our test and experimentally optimize several parameters. We then compare our test to the only previous work on private ANOVA testing, using the same effect size as that work. We see an order of magnitude improvement, with our test requiring only 7% as much data to detect the effect. 
    more » « less
  3. The early detection of where and when fatal infectious diseases outbreak is of critical importance to the public health. To effectively detect, analyze and then intervene the spread of diseases, people's health status along with their location information should be timely collected. However, the conventional practices are via surveys or field health workers, which are highly costly and pose serious privacy threats to participants. In this paper, we for the first time propose to exploit the ubiquitous cloud services to collect users' multi-dimensional data in a secure and privacy-preserving manner and to enable the analysis of infectious disease. Specifically, we target at the spatial clustering analysis using Kulldorf scan statistic and propose a key-oblivious inner product encryption (KOIPE) mechanism to ensure that the untrusted entity only obtains the statistic instead of individual's data. Furthermore, we design an anonymous and sybil-resilient approach to protect the data collection process from double registration attacks and meanwhile preserve participant's privacy against untrusted cloud servers. A rigorous and comprehensive security analysis is given to validate our design, and we also conduct extensive simulations based on real-life datasets to demonstrate the performance of our scheme in terms of communication and computing overhead. 
    more » « less
  4. Summary In many applications of two‐component mixture models such as the popular zero‐inflated model for discrete‐valued data, it is customary for the data analyst to evaluate the inherent heterogeneity in view of observed data. To this end, the score test, acclaimed for its simplicity, is routinely performed. It has long been recognised that this test may behave erratically under model misspecification, but the implications of this behaviour remain poorly understood for popular two‐component mixture models. For the special case of zero‐inflated count models, we use data simulations and theoretical arguments to evaluate this behaviour and discuss its implications in settings where the working model is restrictive with regard to the true data‐generating mechanism. We enrich this discussion with an analysis of count data in HIV research, where a one‐component model is shown to fit the data reasonably well despite apparent extra zeros. These results suggest that a rejection of homogeneity does not imply that the underlying mixture model is appropriate. Rather, such a rejection simply implies that the mixture model should be carefully interpreted in the light of potential model misspecifications, and further evaluated against other competing models. 
    more » « less
  5. Although outsourcing data to cloud storage has become popular, the increasing concerns about data security and privacy in the cloud limits broader cloud adoption. Ensuring data security and privacy, therefore, is crucial for better and broader adoption of the cloud. This tutorial provides a comprehensive analysis of the state-of-the-art in the context of data security and privacy for outsourced data. We aim to cover common security and privacy threats for outsourced data, and relevant novel schemes and techniques with their design choices regarding security, privacy, functionality, and performance. Our explicit focus is on recent schemes from both the database and the cryptography and security communities that enable query processing over encrypted data and access oblivious cloud storage systems. 
    more » « less