skip to main content


Title: Implications of Data Anonymization on the Statistical Evidence of Disparity
Research and practical development of data-anonymization techniques have proliferated in recent years. Yet, limited attention has been paid to examine the potentially disparate impact of privacy protection on underprivileged subpopulations. This study is one of the first attempts to examine the extent to which data anonymization could mask the gross statistical disparities between subpopulations in the data. We first describe two common mechanisms of data anonymization and two prevalent types of statistical evidence for disparity. Then, we develop conceptual foundation and mathematical formalism demonstrating that the two data-anonymization mechanisms have distinctive impacts on the identifiability of disparity, which also varies based on its statistical operationalization. After validating our findings with empirical evidence, we discuss the business and policy implications, highlighting the need for firms and policy makers to balance between the protection of privacy and the recognition/rectification of disparate impact. This paper was accepted by Chris Forman, information systems.  more » « less
Award ID(s):
1851637
PAR ID:
10431405
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Management Science
Volume:
68
Issue:
4
ISSN:
0025-1909
Page Range / eLocation ID:
2600 to 2618
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Aggregated community-scale data could be harnessed to provide insights into the disparate impacts of managed power outages, burst pipes, and food inaccessibility during extreme weather events. During the winter storm that brought historically low temperatures, snow, and ice to the entire state of Texas in February 2021, Texas power-generating plant operators resorted to rolling blackouts to prevent collapse of the power grid when power demand overwhelmed supply. To reveal the disparate impact of managed power outages on vulnerable subpopulations in Harris County, Texas, which encompasses the city of Houston, we collected and analyzed community-scale big data using statistical and trend classification analyses. The results highlight the spatial and temporal patterns of impacts on vulnerable subpopulations in Harris County. The findings show a significant disparity in the extent and duration of power outages experienced by low-income and minority groups, suggesting the existence of inequality in the management and implementation of the power outage. Also, the extent of burst pipes and disrupted food access, as a proxy for storm impact, were more severe for low-income and minority groups. Insights provided by the results could form a basis from which infrastructure operators might enhance social equality during managed service disruptions in such events. The results and findings demonstrate the value of community-scale big data sources for rapid impact assessment in the aftermath of extreme weather events. 
    more » « less
  2. Differentially private (DP) mechanisms have been deployed in a variety of high-impact social settings (perhaps most notably by the U.S. Census). Since all DP mechanisms involve adding noise to results of statistical queries, they are expected to impact our ability to accurately analyze and learn from data, in effect trading off privacy with utility. Alarmingly, the impact of DP on utility can vary significantly among different sub-populations. A simple way to reduce this disparity is with stratification. First compute an independent private estimate for each group in the data set (which may be the intersection of several protected classes), then, to compute estimates of global statistics, appropriately recombine these group estimates. Our main observation is that naive stratification often yields high-accuracy estimates of population-level statistics, without the need for additional privacy budget. We support this observation theoretically and empirically. Our theoretical results center on the private mean estimation problem, while our empirical results center on extensive experiments on private data synthesis to demonstrate the effectiveness of stratification on a variety of private mechanisms. Overall, we argue that this straightforward approach provides a strong baseline against which future work on reducing utility disparities of DP mechanisms should be compared. 
    more » « less
  3. null (Ed.)
    The increasing impact of algorithmic decisions on people’s lives compels us to scrutinize their fairness and, in particular, the disparate impacts that ostensibly color-blind algorithms can have on different groups. Examples include credit decisioning, hiring, advertising, criminal justice, personalized medicine, and targeted policy making, where in some cases legislative or regulatory frameworks for fairness exist and define specific protected classes. In this paper we study a fundamental challenge to assessing disparate impacts in practice: protected class membership is often not observed in the data. This is particularly a problem in lending and healthcare. We consider the use of an auxiliary data set, such as the U.S. census, to construct models that predict the protected class from proxy variables, such as surname and geolocation. We show that even with such data, a variety of common disparity measures are generally unidentifiable, providing a new perspective on the documented biases of popular proxy-based methods. We provide exact characterizations of the tightest possible set of all possible true disparities that are consistent with the data (and possibly additional assumptions). We further provide optimization-based algorithms for computing and visualizing these sets and statistical tools to assess sampling uncertainty. Together, these enable reliable and robust assessments of disparities—an important tool when disparity assessment can have far-reaching policy implications. We demonstrate this in two case studies with real data: mortgage lending and personalized medicine dosing. This paper was accepted by Hamid Nazerzadeh, Guest Editor for the Special Issue on Data-Driven Prescriptive Analytics. 
    more » « less
  4. In the era of digital communities, a massive volume of data is created from people's online activities on a daily basis. Such data is sometimes shared with third-parties for commercial benefits, which has caused people's concerns about privacy disclosure. Privacy preserving technologies have been developed to protect people's sensitive information in data publishing. However, due to the availability of data from other sources, e.g., blogging, it is still possible to de-anonymize users even from anonymized data sets. This paper presents the design and implementation of an Interactive De-Anonymization Learning system—IDEAL. The system can help students learn about de-anonymization through engaging hands-on activities, such as tuning different parameters to evaluate their impact on the accuracy of de-anonymization, and observing the affect of data anonymization on de-anonymization. A pilot lab session to evaluate the system was conducted among thirty-five students at Prairie View A&M University and the feedback was very positive. 
    more » « less
  5. Most organizations rely on relational database(s) for their day-to-day business functions. Data management policies fall under the umbrella of IT Operations, dictated by a combination of internal organizational policies and government regulations. Many privacy laws (such as Europe’s General Data Protection Regulation and California’s Consumer Privacy Act) establish policy requirements for organizations, requiring the preservation or purging of certain customer data across their systems. Organization disaster recovery policies also mandate backup policies to prevent data loss. Thus, the data in these databases are subject to a range of policies, including data retention and data purging rules, which may come into conflict with the need for regular backups. In this paper, we discuss the trade-offs between different compliance mechanisms to maintain IT Operational policies. We consider the practical availability of data in an active relational database and in a backup, including: 1) supporting data privacy rules with respect to preserving or purging customer data, and 2) the application performance impact caused by the database policy implementation. We first discuss the state of data privacy compliance in database systems. We then look at enforcement of common IT operational policies with regard to database backups. We consider different implementations used to enforce privacy rule compliance combined with a detailed discussion for how these approaches impact the performance of a database at different phases. We demonstrate that naive compliance implementations will incur a prohibitively high cost and impose onerous restrictions on backup and restore process, but will not affect daily user query transaction cost. However, we also show that other solutions can achieve a far lower backup and restore costs at a price of a small (<5%) overhead to non-SELECT queries. 
    more » « less