Abstract Objective Supporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data. Materials and Methods The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 threshold of 0.01. Results When sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%. Conclusion Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.
more »
« less
How Adversarial Assumptions Influence Re-identification Risk Measures: A COVID-19 Case Study
The COVID-19 pandemic highlights the need for broad dissemination of case surveillance data. Local and global public health agencies have initiated efforts to do so, but there remains limited data available, due in part to concerns over privacy. As a result, current COVID-19 case surveillance data sharing policies are based on strong adversarial assumptions, such as the expectation that an attacker can readily re-identify individuals based on their distinguishability in a dataset. There are various re-identification risk measures to account for adversarial capabilities; however, the current array insufficiently accounts for real world data challenges - particularly issues of missing records in resources of identifiable records that adversaries may rely upon to execute attacks (e.g., 10 50-year-old male in the de-identified dataset vs. 5 50-year-old male in the identified dataset). In this paper, we introduce several approaches to amend such risk measures and assess re-identification risk in light of how an attacker's capabilities relate to missing records. We demonstrate the potential for these measures through a record linkage attack using COVID-19 case surveillance data and voter registration records in the state of Florida. Our findings demonstrate that adversarial assumptions, as realized in a risk measure, can dramatically affect re-identification risk estimation. Notably, we show that the re-identification risk is likely to be substantially smaller than the typical risk thresholds, which suggests that more detailed data could be shared publicly than is currently the case.
more »
« less
- Award ID(s):
- 2029661
- PAR ID:
- 10362773
- Date Published:
- Journal Name:
- International Conference on Privacy in Statistical Databases
- Page Range / eLocation ID:
- 361--374
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundSecuring adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. MethodsPredictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. ResultsAll 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. ConclusionsAs the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.more » « less
-
The US and the rest of the world have suffered from the COVID-19 pandemic for over a year. The high transmissibility and severity of this virus have provoked governments to adopt a variety of mitigation strategies. Some of these previous measures, such as social distancing and mask mandates, were effective in reducing the case growth rate yet became economically and administratively difficult to enforce as the pandemic continued. In late December 2020, COVID-19 vaccines were first approved in the US and states began a phased implementation of COVID-19 vaccination. However, there is limited quantitative evidence regarding the effectiveness of the phased COVID-19 vaccination. This study aims to provide a rapid assessment of the adoption, reach, and effectiveness of the phased implementation of COVID-19 vaccination. We utilize an event-study analysis to evaluate the effect of vaccination on the state-level daily COVID-19 case growth rate. Through this analysis, we assert that vaccination was effective in reducing the spread of COVID-19 shortly after the first shots were given. Specifically, the case growth rate declined by 0.124, 0.347, 0.345, 0.464, 0.490, and 0.756 percentage points corresponding to the 1–5, 6–10, 11–15, 16–20, 21–25, and 26 or more day periods after the initial shots. The findings could be insightful for policymakers as they work to optimize vaccine distribution in later phases, and also for the public as the COVID-19 related health risk is a contentious issue.more » « less
-
null (Ed.)Background Significant uncertainty has existed about the safety of reopening college and university campuses before the COVID-19 pandemic is better controlled. Moreover, little is known about the effects that on-campus students may have on local higher-risk communities. Objective We aimed to estimate the range of potential community and campus COVID-19 exposures, infections, and mortality under various university reopening plans and uncertainties. Methods We developed campus-only, community-only, and campus × community epidemic differential equations and agent-based models, with inputs estimated via published and grey literature, expert opinion, and parameter search algorithms. Campus opening plans (spanning fully open, hybrid, and fully virtual approaches) were identified from websites and publications. Additional student and community exposures, infections, and mortality over 16-week semesters were estimated under each scenario, with 10% trimmed medians, standard deviations, and probability intervals computed to omit extreme outliers. Sensitivity analyses were conducted to inform potential effective interventions. Results Predicted 16-week campus and additional community exposures, infections, and mortality for the base case with no precautions (or negligible compliance) varied significantly from their medians (4- to 10-fold). Over 5% of on-campus students were infected after a mean of 76 (SD 17) days, with the greatest increase (first inflection point) occurring on average on day 84 (SD 10.2 days) of the semester and with total additional community exposures, infections, and mortality ranging from 1-187, 13-820, and 1-21 per 10,000 residents, respectively. Reopening precautions reduced infections by 24%-26% and mortality by 36%-50% in both populations. Beyond campus and community reproductive numbers, sensitivity analysis indicated no dominant factors that interventions could primarily target to reduce the magnitude and variability in outcomes, suggesting the importance of comprehensive public health measures and surveillance. Conclusions Community and campus COVID-19 exposures, infections, and mortality resulting from reopening campuses are highly unpredictable regardless of precautions. Public health implications include the need for effective surveillance and flexible campus operations.more » « less
-
null (Ed.)Reopening of colleges and universities for the Fall semester of 2020 across the United States has caused signi ficant COVID-19 case spikes, requiring reactive responses such as temporary closures and switching to online learning. Until sufficient levels of immunity are reached through vaccination, Institutions of Higher Education will need to balance academic operations with COVID-19 spread risk within and outside the student community. In this work, we study the impact of proximity statistics obtained from high resolution mobility traces in predicting case rate surges in university counties. We focus on 50 land-grant university counties (LGUCs) across the country and show high correlation (PCC > 0.6) between proximity statistics and COVID-19 case rates for several LGUCs during the period around Fall 2020 reopenings. These observations provide a lead time of up to 3 weeks in preparing resources and planning containment efforts. We also show how features such as total population, population affiliated with university, median income and case rate intensity could explain some of the observed high correlation. We believe these easily explainable mobility metrics along with other disease surveillance indicators can help universities be better prepared for the Spring 2021 semester.more » « less