skip to main content


This content will become publicly available on May 9, 2024

Title: The promise and limitations of formal privacy
Abstract

Differential privacy (DP) is in our smart phones, web browsers, social media, and the federal statistics used to allocate billions of dollars. Despite the mathematical concept being only 17 years old, differential privacy has amassed a rapidly growing list of real‐world applications, such as Meta and US Census Bureau data. Why is DP so pervasive? DP is currently the only mathematical framework that provides a finite and quantifiable bound on disclosure risk when releasing information from confidential data. Previous concepts of data privacy and confidentiality required various assumptions about how a bad actor might attack sensitive data. DP is often called formally private because statisticians can mathematically prove the worst‐case scenario privacy loss that could result from releasing information based on the confidential data. Although DP ushered in a new era of data privacy and confidentiality methodologies, many researchers and data practitioners criticize differentially private frameworks. In this paper, we provide readers a critical overview of the current state‐of‐the‐art research on formal privacy methodologies and various relevant perspectives, challenges, and opportunities.

This article is categorized under:

Applications of Computational Statistics > Defense and National Security

 
more » « less
NSF-PAR ID:
10419159
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
WIREs Computational Statistics
Volume:
15
Issue:
6
ISSN:
1939-5108
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data sets and statistics about groups of individuals are increasingly collected and released, feeding many optimization and learning algorithms. In many cases, the released data contain sensitive information whose privacy is strictly regulated. For example, in the U.S., the census data is regulated under Title 13, which requires that no individual be identified from any data released by the Census Bureau. In Europe, data release is regulated according to the General Data Protection Regulation, which addresses the control and transfer of personal data. Differential privacy has emerged as the de-facto standard to protect data privacy. In a nutshell, differentially private algorithms protect an individual’s data by injecting random noise into the output of a computation that involves such data. While this process ensures privacy, it also impacts the quality of data analysis, and, when private data sets are used as inputs to complex machine learning or optimization tasks, they may produce results that are fundamentally different from those obtained on the original data and even rise unintended bias and fairness concerns. In this talk, I will first focus on the challenge of releasing privacy-preserving data sets for complex data analysis tasks. I will introduce the notion of Constrained-based Differential Privacy (C-DP), which allows casting the data release problem to an optimization problem whose goal is to preserve the salient features of the original data. I will review several applications of C-DP in the context of very large hierarchical census data, data streams, energy systems, and in the design of federated data-sharing protocols. Next, I will discuss how errors induced by differential privacy algorithms may propagate within a decision problem causing biases and fairness issues. This is particularly important as privacy-preserving data is often used for critical decision processes, including the allocation of funds and benefits to states and jurisdictions, which ideally should be fair and unbiased. Finally, I will conclude with a roadmap to future work and some open questions. 
    more » « less
  2. Protection of individual privacy is a common concern when releasing and sharing data and information. Differential privacy (DP) formalizes privacy in probabilistic terms without making assumptions about the background knowledge of data intruders, and thus provides a robust concept for privacy protection. Practical applications of DP involve development of differentially private mechanisms to generate sanitized results at a pre-specified privacy budget. For the sanitization of statistics with publicly known bounds such as proportions and correlation coefficients, the bounding constraints will need to be incorporated in the differentially private mechanisms. There has been little work on examining the consequences of the bounding constraints on the accuracy of sanitized results and the statistical inferences of the population parameters based on the sanitized results. In this paper, we formalize the differentially private truncated and boundary inflated truncated (BIT) procedures for releasing statistics with publicly known bounding constraints. The impacts of the truncated and BIT Laplace procedures on the statistical accuracy and validity of sanitized statistics are evaluated both theoretically and empirically via simulation studies. 
    more » « less
  3. Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an ϵ-JDP algorithm with a regret of O˜(sqrt(SAH^2T) +S^2AH^3/ϵ) which matches the information-theoretic lower bound of non-private learning for all choices of ϵ>S^1.5A^0.5H^2/sqrt(T). In the above, S, A denote the number of states and actions, H denotes the planning horizon, and T is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as T→∞. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case. 
    more » « less
  4. Ruiz, Francisco and (Ed.)
    Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $\epsilon$-JDP algorithm with a regret of $\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$ which matches the information-theoretic lower bound of non-private learning for all choices of $\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$. In the above, $S$, $A$ denote the number of states and actions, $H$ denotes the planning horizon, and $T$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as $T\rightarrow \infty$. Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case. 
    more » « less
  5. A reconstruction attack on a private dataset D takes as input some publicly accessible information about the dataset and produces a list of candidate elements of D . We introduce a class of data reconstruction attacks based on randomized methods for nonconvex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of D from aggregate query statistics Q ( D )∈ℝ m but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identity theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset D was sampled, demonstrating that they are exploiting information in the aggregate statistics Q ( D ) and not simply the overall structure of the distribution. In other words, the queries Q ( D ) are permitting reconstruction of elements of this dataset, not the distribution from which D was drawn. These findings are established both on 2010 US decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset and provide further motivation for the careful application of provably private techniques such as differential privacy. 
    more » « less