skip to main content

Title: Dynamically adjusting case reporting policy to maximize privacy and public health utility in the face of a pandemic
Abstract Objective Supporting public health research and the public’s situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 and recent state-level regulations, permits sharing deidentified person-level data; however, current deidentification approaches are limited. Namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt deidentification for near-real time sharing of person-level surveillance data. Materials and Methods The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the reidentification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework’s effectiveness in maintaining the PK11 threshold of 0.01. Results When sharing COVID-19 county-level case data across all US counties, the framework’s approach meets the threshold for 96.2% of daily data releases, while a policy based on current deidentification techniques meets the threshold for 32.3%. Conclusion Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Journal of the American Medical Informatics Association
Page Range / eLocation ID:
853 to 863
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less
  2. The COVID-19 pandemic highlights the need for broad dissemination of case surveillance data. Local and global public health agencies have initiated efforts to do so, but there remains limited data available, due in part to concerns over privacy. As a result, current COVID-19 case surveillance data sharing policies are based on strong adversarial assumptions, such as the expectation that an attacker can readily re-identify individuals based on their distinguishability in a dataset. There are various re-identification risk measures to account for adversarial capabilities; however, the current array insufficiently accounts for real world data challenges - particularly issues of missing records in resources of identifiable records that adversaries may rely upon to execute attacks (e.g., 10 50-year-old male in the de-identified dataset vs. 5 50-year-old male in the identified dataset). In this paper, we introduce several approaches to amend such risk measures and assess re-identification risk in light of how an attacker's capabilities relate to missing records. We demonstrate the potential for these measures through a record linkage attack using COVID-19 case surveillance data and voter registration records in the state of Florida. Our findings demonstrate that adversarial assumptions, as realized in a risk measure, can dramatically affect re-identification risk estimation. Notably, we show that the re-identification risk is likely to be substantially smaller than the typical risk thresholds, which suggests that more detailed data could be shared publicly than is currently the case. 
    more » « less
  3. Abstract

    Most of the current public health surveillance methods used in epidemiological studies to identify hotspots of diseases assume that the regional disease case counts are independently distributed and they lack the ability of adjusting for confounding covariates. This article proposes a new approach that uses a simultaneous autoregressive (SAR) model, a popular spatial regression approach, within the classical space‐time cumulative sum (CUSUM) framework for detecting changes in the spatial distribution of count data while accounting for risk factors and spatial correlation. We develop expressions for the likelihood ratio test monitoring statistics based on a SAR model with covariates, leading to the proposed space‐time CUSUM test statistic. The effectiveness of the proposed monitoring approach in detecting and identifying step shifts is studied by simulation of various shift scenarios in regional counts. A case study for monitoring regional COVID‐19 infection counts while adjusting for social vulnerability, often correlated with a community's susceptibility towards disease infection, is presented to illustrate the application of the proposed methodology in public health surveillance.

    more » « less
  4. Sharing real-time data originating from connected devices is crucial to real-world Internet of Things (IoT) applications, especially using artificial intelligence/machine learning (AI/ML). Such IoT data are typically shared with multiple parties for different purposes based on data contracts. However, supporting these contracts under the dynamic change of IoT data variety and velocity faces many challenges when such parties (aka tenants) want to obtain data based on the data value to their specific contextual purposes. This work proposes a novel dynamic context-based policy enforcement framework to support IoT data sharing based on dynamic contracts. Our enforcement framework allows IoT Data Hub owners to define extensible rules and metrics to govern the tenants in accessing the shared data on the Edge based on policies defined in static and dynamic contexts. For example, given the change of situations, we can define and enforce a policy that allows pushing data to some tenants via a third-party means, while typically, these tenants must obtain and process the data based on a pre-defined means. We have developed a proof-of-concept prototype for sharing sensitive data such as surveillance camera videos to illustrate our proposed framework. Our experimental results demonstrated that our framework could soundly and timely enforce context-based policies at runtime with moderate overhead. Moreover, the context and policy changes are correctly reflected in the system in nearly real-time. 
    more » « less
  5. Livestock industry is daily producing large amounts of multi-scale data (pathogen-, animal-, site-, system-, regional- level) from different sources such as diagnostic laboratories, trade and production records, management and environmental monitoring systems; however, all these data are still presented and used separately and are largely infra-utilized to timely (i.e., near real-time) inform livestock health decisions. Recent advances in the automation of data capture, standardization, multi-scale integration and sharing/communication (i.e. The Internet Of Things) as well as in the development of novel data mining analytical and visualization capabilities specifically adapted to the livestock industry are dramatically changing this paradigm. As a result, we expect vertical advances in the way we prevent and manage livestock diseases both locally and globally. Our team at the Center for Animal Disease Modeling and Surveillance (CADMS), in collaboration with researchers at Iowa State University and industry leaders at Boehringer Ingelheim and GlobalVetLINK have been working in an exceptional research-industry partnership to develop key data connections and novel Big Data capabilities within the Disease BioPortal ( This web-based platform includes automation of diagnostic interpretations and facilitates the combined analysis of health, production and trade data using novel space-time-genomic visualization and data mining tools. Access to confidential databases is individually granted with different levels of secure access, visualization and editing capabilities for participating producers, labs, veterinarians and other stakeholders. Each user can create and share customized dashboards and reports to inform risk-based, more cost-effective, decisions at site, system or regional level. Here we will provide practical examples of applications in the swine, poultry and aquaculture industries. We hope to contribute to the more coordinated and effective prevention and control of infectious diseases locally and globally. 
    more » « less