skip to main content


Title: Differential privacy in the 2020 US census: what will it do? Quantifying the accuracy/privacy tradeoff
Background: The 2020 US Census will use a novel approach to disclosure avoidance to protect respondents’ data, called TopDown. This TopDown algorithm was applied to the 2018 end-to-end (E2E) test of the decennial census. The computer code used for this test as well as accompanying exposition has recently been released publicly by the Census Bureau. Methods: We used the available code and data to better understand the error introduced by the E2E disclosure avoidance system when Census Bureau applied it to 1940 census data and we developed an empirical measure of privacy loss to compare the error and privacy of the new approach to that of a (non-differentially private) simple-random-sampling approach to protecting privacy. Results: We found that the empirical privacy loss of TopDown is substantially smaller than the theoretical guarantee for all privacy loss budgets we examined. When run on the 1940 census data, TopDown with a privacy budget of 1.0 was similar in error and privacy loss to that of a simple random sample of 50% of the US population. When run with a privacy budget of 4.0, it was similar in error and privacy loss of a 90% sample. Conclusions: This work fits into the beginning of a discussion on how to best balance privacy and accuracy in decennial census data collection, and there is a need for continued discussion.  more » « less
Award ID(s):
1839116
NSF-PAR ID:
10192012
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Gates Open Research
Volume:
3
ISSN:
2572-4754
Page Range / eLocation ID:
1722
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ligett, Katrina ; Gupta, Swati (Ed.)
    The 2020 Decennial Census will be released with a new disclosure avoidance system in place, putting differential privacy in the spotlight for a wide range of data users. We consider several key applications of Census data in redistricting, developing tools and demonstrations for practitioners who are concerned about the impacts of this new noising algorithm called TopDown. Based on a close look at reconstructed Texas data, we find reassuring evidence that TopDown will not threaten the ability to produce districts with tolerable population balance or to detect signals of racial polarization for Voting Rights Act enforcement. 
    more » « less
  2. Abstract

    Differential privacy (DP) is in our smart phones, web browsers, social media, and the federal statistics used to allocate billions of dollars. Despite the mathematical concept being only 17 years old, differential privacy has amassed a rapidly growing list of real‐world applications, such as Meta and US Census Bureau data. Why is DP so pervasive? DP is currently the only mathematical framework that provides a finite and quantifiable bound on disclosure risk when releasing information from confidential data. Previous concepts of data privacy and confidentiality required various assumptions about how a bad actor might attack sensitive data. DP is often called formally private because statisticians can mathematically prove the worst‐case scenario privacy loss that could result from releasing information based on the confidential data. Although DP ushered in a new era of data privacy and confidentiality methodologies, many researchers and data practitioners criticize differentially private frameworks. In this paper, we provide readers a critical overview of the current state‐of‐the‐art research on formal privacy methodologies and various relevant perspectives, challenges, and opportunities.

    This article is categorized under:

    Applications of Computational Statistics > Defense and National Security

     
    more » « less
  3. Abstract

    The dissemination of synthetic data can be an effective means of making information from sensitive data publicly available with a reduced risk of disclosure. While mechanisms exist for synthesizing data that satisfy formal privacy guarantees, these mechanisms do not typically resemble the models an end-user might use to analyse the data. More recently, the use of methods from the disease mapping literature has been proposed to generate spatially referenced synthetic data with high utility but without formal privacy guarantees. The objective for this paper is to help bridge the gap between the disease mapping and the differential privacy literatures. In particular, we generalize an approach for generating differentially private synthetic data currently used by the US Census Bureau to the case of Poisson-distributed count data in a way that accommodates heterogeneity in population sizes and allows for the infusion of prior information regarding the underlying event rates. Following a pair of small simulation studies, we illustrate the utility of the synthetic data produced by this approach using publicly available, county-level heart disease-related death counts. This study demonstrates the benefits of the proposed approach’s flexibility with respect to heterogeneity in population sizes and event rates while motivating further research to improve its utility.

     
    more » « less
  4. This study examines issues of Small Area Estimation that are raised by reliance on the American Community Survey (ACS), which reports tract‐level data based on much smaller samples than the decennial census long‐form that it replaced. We demonstrate the problem using a 100% transcription of microdata from the 1940 census. By drawing many samples from two major cities, we confirm a known pattern: random samples yield unbiased point estimates of means or proportions, but estimates based on smaller samples have larger average errors in measurement and greater risk of large error. Sampling variability also inflates estimates of measures of variation across areas (reflecting segregation or spatial inequality). This variation is at the heart of much contemporary spatial analysis. We then evaluate possible solutions. For point estimates, we examine three Bayesian models, all of which reduce sampling variation, and we encourage use of such models to correct ACS small area estimates. However, the corrected estimates cannot be used to calculate estimates of variation, because smoothing toward local or grand means artificially reduces variation. We note that there are potential Bayesian approaches to this problem, and we demonstrate an efficacious alternative that uses the original sample data.

     
    more » « less
  5. When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research. 
    more » « less