skip to main content

Title: Differentially private data release via statistical election to partition sequentially
Differential Privacy (DP) formalizes privacy in mathematical terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce and release synthetic individual-level data in the DP framework. One key challenge to develop DIPS methods is the preservation of the statistical utility of synthetic data, especially in high-dimensional settings. We propose a new DIPS approach, STatistical Election to Partition Sequentially (STEPS) that partitions data by attributes according to their importance ranks according to either a practical or statistical importance measure. STEPS aims to achieve better original information preservation for the attributes with higher importance ranks and produce thus more useful synthetic data overall. We present an algorithm to implement the STEPS procedure and employ the privacy budget composability to ensure the overall privacy cost is controlled at the pre-specified value. We apply the STEPS procedure to both simulated data and the 2000–2012 Current Population Survey youth voter data. The results suggest STEPS can better preserve the population-level information and the original information for some analyses compared to PrivBayes, a modified Uniform histogram approach, and the flat Laplace sanitizer.
; ;
Award ID(s):
1546373 1717417
Publication Date:
Journal Name:
Sponsoring Org:
National Science Foundation
More Like this
  1. Many social networks contain sensitive relational information. One approach to protect the sensitive relational information while offering flexibility for social network research and analysis is to release synthetic social networks at a pre-specified privacy risk level, given the original observed network. We propose the DP-ERGM procedure that synthesizes networks that satisfy the differential privacy (DP) via the exponential random graph model (EGRM). We apply DP-ERGM to a college student friendship network and compare its original network information preservation in the generated private networks with two other approaches: differentially private DyadWise Randomized Response (DWRR) and Sanitization of the Conditional probability ofmore »Edge given Attribute classes (SCEA). The results suggest that DP-EGRM preserves the original information significantly better than DWRR and SCEA in both network statistics and inferences from ERGMs and latent space models. In addition, DP-ERGM satisfies the node DP, a stronger notion of privacy than the edge DP that DWRR and SCEA satisfy.« less
  2. Abstract Organizations often collect private data and release aggregate statistics for the public’s benefit. If no steps toward preserving privacy are taken, adversaries may use released statistics to deduce unauthorized information about the individuals described in the private dataset. Differentially private algorithms address this challenge by slightly perturbing underlying statistics with noise, thereby mathematically limiting the amount of information that may be deduced from each data release. Properly calibrating these algorithms—and in turn the disclosure risk for people described in the dataset—requires a data curator to choose a value for a privacy budget parameter, ɛ . However, there is littlemore »formal guidance for choosing ɛ , a task that requires reasoning about the probabilistic privacy–utility tradeoff. Furthermore, choosing ɛ in the context of statistical inference requires reasoning about accuracy trade-offs in the presence of both measurement error and differential privacy (DP) noise. We present Vi sualizing P rivacy (ViP), an interactive interface that visualizes relationships between ɛ , accuracy, and disclosure risk to support setting and splitting ɛ among queries. As a user adjusts ɛ , ViP dynamically updates visualizations depicting expected accuracy and risk. ViP also has an inference setting, allowing a user to reason about the impact of DP noise on statistical inferences. Finally, we present results of a study where 16 research practitioners with little to no DP background completed a set of tasks related to setting ɛ using both ViP and a control. We find that ViP helps participants more correctly answer questions related to judging the probability of where a DP-noised release is likely to fall and comparing between DP-noised and non-private confidence intervals.« less
  3. We report on the first molecular estimates of phylogenetic relationships of Brachymeles dalawangdaliri (Scincidae) and Pseudogekko isapa (Gekkonidae), and present new data on phenotypic variation in these two poorly known taxa, endemic to the Romblon Island Group of the central Philippines. Because both species were recently described on the basis of few, relatively older, museum specimens collected in the early 1970s (when preservation of genetic material was not yet standard practice in biodiversity field inventories), neither taxon has ever been included in modern molecular phylogenetic analyses. Likewise, because the original type series for each species consisted of only a fewmore »specimens, biologists have been unable to assess standard morphological variation in either taxon, or statistically assess the importance of characters contributing to their diagnoses and identification. Here we ameliorate both historical shortfalls. First, our new genetic data allowed us to perform novel molecular phylogenetic analyses aimed at elucidating the evolutionary relationships of these lineages; secondly, with population level phenotypic data, from the first statistical sample collected for either species, and including adults of both sexes. We reaffirm the distinctiveness of both named taxa as valid species, amend their diagnoses to facilitate the recognition of both, distinguish them from congeners, and consider the biogeographic affinities of both lineages. Our contribution emphasizes the conservation significance of Sibuyan Island’s Mt. Guiting-Guiting Natural Park, the diverse, idiosyncratic biogeographic histories of its variably-assembled, highly endemic reptile fauna, and the critical importance of multiple, repeated, survey–resurvey studies for understanding forest community species composition and the evolutionary history of Philippine biodiversity.« less
  4. Bailey, Michael ; Greenstadt, Rachel (Ed.)
    In differential privacy (DP), a challenging problem is to generate synthetic datasets that efficiently capture the useful information in the private data. The synthetic dataset enables any task to be done without privacy concern and modification to existing algorithms. In this paper, we present PrivSyn, the first automatic synthetic data generation method that can handle general tabular datasets (with 100 attributes and domain size > 2500). PrivSyn is composed of a new method to automatically and privately identify correlations in the data, and a novel method to generate sample data from a dense graphic model. We extensively evaluate different methodsmore »on multiple datasets to demonstrate the performance of our method.« less
  5. Protection of individual privacy is a common concern when releasing and sharing data and information. Differential privacy (DP) formalizes privacy in probabilistic terms without making assumptions about the background knowledge of data intruders, and thus provides a robust concept for privacy protection. Practical applications of DP involve development of differentially private mechanisms to generate sanitized results at a pre-specified privacy budget. For the sanitization of statistics with publicly known bounds such as proportions and correlation coefficients, the bounding constraints will need to be incorporated in the differentially private mechanisms. There has been little work on examining the consequences of themore »bounding constraints on the accuracy of sanitized results and the statistical inferences of the population parameters based on the sanitized results. In this paper, we formalize the differentially private truncated and boundary inflated truncated (BIT) procedures for releasing statistics with publicly known bounding constraints. The impacts of the truncated and BIT Laplace procedures on the statistical accuracy and validity of sanitized statistics are evaluated both theoretically and empirically via simulation studies.« less