skip to main content

Title: How Adversarial Assumptions Influence Re-identification Risk Measures: A COVID-19 Case Study
The COVID-19 pandemic highlights the need for broad dissemination of case surveillance data. Local and global public health agencies have initiated efforts to do so, but there remains limited data available, due in part to concerns over privacy. As a result, current COVID-19 case surveillance data sharing policies are based on strong adversarial assumptions, such as the expectation that an attacker can readily re-identify individuals based on their distinguishability in a dataset. There are various re-identification risk measures to account for adversarial capabilities; however, the current array insufficiently accounts for real world data challenges - particularly issues of missing records in resources of identifiable records that adversaries may rely upon to execute attacks (e.g., 10 50-year-old male in the de-identified dataset vs. 5 50-year-old male in the identified dataset). In this paper, we introduce several approaches to amend such risk measures and assess re-identification risk in light of how an attacker's capabilities relate to missing records. We demonstrate the potential for these measures through a record linkage attack using COVID-19 case surveillance data and voter registration records in the state of Florida. Our findings demonstrate that adversarial assumptions, as realized in a risk measure, can dramatically affect re-identification risk estimation. Notably, we show that the re-identification risk is likely to be substantially smaller than the typical risk thresholds, which suggests that more detailed data could be shared publicly than is currently the case.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
International Conference on Privacy in Statistical Databases
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Objective Supporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data. Materials and Methods The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 threshold of 0.01. Results When sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%. Conclusion Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features. 
    more » « less
  2. null (Ed.)
    Background Significant uncertainty has existed about the safety of reopening college and university campuses before the COVID-19 pandemic is better controlled. Moreover, little is known about the effects that on-campus students may have on local higher-risk communities. Objective We aimed to estimate the range of potential community and campus COVID-19 exposures, infections, and mortality under various university reopening plans and uncertainties. Methods We developed campus-only, community-only, and campus × community epidemic differential equations and agent-based models, with inputs estimated via published and grey literature, expert opinion, and parameter search algorithms. Campus opening plans (spanning fully open, hybrid, and fully virtual approaches) were identified from websites and publications. Additional student and community exposures, infections, and mortality over 16-week semesters were estimated under each scenario, with 10% trimmed medians, standard deviations, and probability intervals computed to omit extreme outliers. Sensitivity analyses were conducted to inform potential effective interventions. Results Predicted 16-week campus and additional community exposures, infections, and mortality for the base case with no precautions (or negligible compliance) varied significantly from their medians (4- to 10-fold). Over 5% of on-campus students were infected after a mean of 76 (SD 17) days, with the greatest increase (first inflection point) occurring on average on day 84 (SD 10.2 days) of the semester and with total additional community exposures, infections, and mortality ranging from 1-187, 13-820, and 1-21 per 10,000 residents, respectively. Reopening precautions reduced infections by 24%-26% and mortality by 36%-50% in both populations. Beyond campus and community reproductive numbers, sensitivity analysis indicated no dominant factors that interventions could primarily target to reduce the magnitude and variability in outcomes, suggesting the importance of comprehensive public health measures and surveillance. Conclusions Community and campus COVID-19 exposures, infections, and mortality resulting from reopening campuses are highly unpredictable regardless of precautions. Public health implications include the need for effective surveillance and flexible campus operations. 
    more » « less
  3. null (Ed.)
    Background Conventional diagnosis of COVID-19 with reverse transcription polymerase chain reaction (RT-PCR) testing (hereafter, PCR) is associated with prolonged time to diagnosis and significant costs to run the test. The SARS-CoV-2 virus might lead to characteristic patterns in the results of widely available, routine blood tests that could be identified with machine learning methodologies. Machine learning modalities integrating findings from these common laboratory test results might accelerate ruling out COVID-19 in emergency department patients. Objective We sought to develop (ie, train and internally validate with cross-validation techniques) and externally validate a machine learning model to rule out COVID 19 using only routine blood tests among adults in emergency departments. Methods Using clinical data from emergency departments (EDs) from 66 US hospitals before the pandemic (before the end of December 2019) or during the pandemic (March-July 2020), we included patients aged ≥20 years in the study time frame. We excluded those with missing laboratory results. Model training used 2183 PCR-confirmed cases from 43 hospitals during the pandemic; negative controls were 10,000 prepandemic patients from the same hospitals. External validation used 23 hospitals with 1020 PCR-confirmed cases and 171,734 prepandemic negative controls. The main outcome was COVID 19 status predicted using same-day routine laboratory results. Model performance was assessed with area under the receiver operating characteristic (AUROC) curve as well as sensitivity, specificity, and negative predictive value (NPV). Results Of 192,779 patients included in the training, external validation, and sensitivity data sets (median age decile 50 [IQR 30-60] years, 40.5% male [78,249/192,779]), AUROC for training and external validation was 0.91 (95% CI 0.90-0.92). Using a risk score cutoff of 1.0 (out of 100) in the external validation data set, the model achieved sensitivity of 95.9% and specificity of 41.7%; with a cutoff of 2.0, sensitivity was 92.6% and specificity was 59.9%. At the cutoff of 2.0, the NPVs at a prevalence of 1%, 10%, and 20% were 99.9%, 98.6%, and 97%, respectively. Conclusions A machine learning model developed with multicenter clinical data integrating commonly collected ED laboratory data demonstrated high rule-out accuracy for COVID-19 status, and might inform selective use of PCR-based testing. 
    more » « less
  4. Abstract Background

    No versatile web app exists that allows epidemiologists and managers around the world to comprehensively analyze the impacts of COVID-19 mitigation. The app presented here fills this gap.


    Our web app uses a model that explicitly identifies susceptible, contact, latent, asymptomatic, symptomatic and recovered classes of individuals, and a parallel set of response classes, subject to lower pathogen-contact rates. The user inputs a CSV file of incidence and, if of interest, mortality rate data. A default set of parameters is available that can be overwritten through input or online entry, and a user-selected subset of these can be fitted to the model using maximum-likelihood estimation (MLE). Model fitting and forecasting intervals are specifiable and changes to parameters allow counterfactual and forecasting scenarios. Confidence or credible intervals can be generated using stochastic simulations, based on MLE values, or on an inputted CSV file containing Markov chain Monte Carlo (MCMC) estimates of one or more parameters.


    We illustrate the use of our web app in extracting social distancing, social relaxation, surveillance or virulence switching functions (i.e., time varying drivers) from the incidence and mortality rates of COVID-19 epidemics in Israel, South Africa, and England. The Israeli outbreak exhibits four distinct phases: initial outbreak, social distancing, social relaxation, and a second wave mitigation phase. An MCMC projection of this latter phase suggests the Israeli epidemic will continue to produce into late November an average of around 1500 new case per day, unless the population practices social-relaxation measures at least 5-fold below the level in August, which itself is 4-fold below the level at the start of July. Our analysis of the relatively late South African outbreak that became the world’s fifth largest COVID-19 epidemic in July revealed that the decline through late July and early August was characterised by a social distancing driver operating at more than twice the per-capita applicable-disease-class (pc-adc) rate of the social relaxation driver. Our analysis of the relatively early English outbreak, identified a more than 2-fold improvement in surveillance over the course of the epidemic. It also identified a pc-adc social distancing rate in early August that, though nearly four times the pc-adc social relaxation rate, appeared to barely contain a second wave that would break out if social distancing was further relaxed.


    Our web app provides policy makers and health officers who have no epidemiological modelling or computer coding expertise with an invaluable tool for assessing the impacts of different outbreak mitigation policies and measures. This includes an ability to generate an epidemic-suppression or curve-flattening index that measures the intensity with which behavioural responses suppress or flatten the epidemic curve in the region under consideration.

    more » « less
  5. null (Ed.)
    The contributions of asymptomatic infections to herd immunity and community transmission are key to the resurgence and control of COVID-19, but are difficult to estimate using current models that ignore changes in testing capacity. Using a model that incorporates daily testing information fit to the case and serology data from New York City, we show that the proportion of symptomatic cases is low, ranging from 13 to 18%, and that the reproductive number may be larger than often assumed. Asymptomatic infections contribute substantially to herd immunity, and to community transmission together with presymptomatic ones. If asymptomatic infections transmit at similar rates as symptomatic ones, the overall reproductive number across all classes is larger than often assumed, with estimates ranging from 3.2 to 4.4. If they transmit poorly, then symptomatic cases have a larger reproductive number ranging from 3.9 to 8.1. Even in this regime, presymptomatic and asymptomatic cases together comprise at least 50% of the force of infection at the outbreak peak. We find no regimes in which all infection subpopulations have reproductive numbers lower than three. These findings elucidate the uncertainty that current case and serology data cannot resolve, despite consideration of different model structures. They also emphasize how temporal data on testing can reduce and better define this uncertainty, as we move forward through longer surveillance and second epidemic waves. Complementary information is required to determine the transmissibility of asymptomatic cases, which we discuss. Regardless, current assumptions about the basic reproductive number of severe acute respiratory syndrome coronavirus 2 (SARS-Cov-2) should be reconsidered. 
    more » « less