skip to main content


Title: Estimating spread of contact-based contagions in a population through sub-sampling
Various phenomena such as viruses, gossips, and physical objects (e.g., packages and marketing pamphlets) can be spread through physical contacts. The spread depends on how people move, i.e., their mobility patterns. In practice, mobility patterns of an entire population is never available, and we usually have access to location data of a subset of individuals. In this paper, we formalize and study the problem of estimating the spread of a phenomena in a population, given that we only have access to sub-samples of location visits of some individuals in the population. We show that simple solutions that estimate the spread in the sub-sample and scale it to the population, or more sophisticated solutions that rely on modeling location visits of individuals do not perform well in practice. Instead, we directly model the co-locations between the individuals. We introduce PollSpreader and PollSusceptible, two novel approaches that model the co-locations between individuals using a contact network , and infer the properties of the contact network using the sub-sample to estimate the spread of the phenomena in the entire population. We analytically show that our estimates provide an upper bound and a lower bound on the spread of the disease in expectation. Finally, using a large high-resolution real-world mobility dataset, we experimentally show that our estimates are accurate in practice, while other methods that do not correctly account for co-locations between individuals result in entirely wrong observations (e.g, premature prediction of herd-immunity).  more » « less
Award ID(s):
2125530 2027794
NSF-PAR ID:
10332821
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
9
ISSN:
2150-8097
Page Range / eLocation ID:
1557 to 1569
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Controlling the spread of infectious diseases―even when safe, transmission-blocking vaccines are available―may require the effective use of non-pharmaceutical interventions (NPIs), e.g., mask wearing, testing, limits on group sizes, venue closure. During the SARS-CoV-2 pandemic, many countries implemented NPIs inconsistently in space and time. This inconsistency was especially pronounced for policies in the United States of America (US) related to venue closure.

    Methods

    Here, we investigate the impact of inconsistent policies associated with venue closure using mathematical modeling and high-resolution human mobility, Google search, and county-level SARS-CoV-2 incidence data from the USA. Specifically, we look at high-resolution location data and perform a US-county-level analysis of nearly 8 million SARS-CoV-2 cases and 150 million location visits, including 120 million church visitors across 184,677 churches, 14 million grocery visitors across 7662 grocery stores, and 13.5 million gym visitors across 5483 gyms.

    Results

    Analyzing the interaction between venue closure and changing mobility using a mathematical model shows that, across a broad range of model parameters, inconsistent or partial closure can be worse in terms of disease transmission as compared to scenarios with no closures at all. Importantly, changes in mobility patterns due to epidemic control measures can lead to increase in the future number of cases. In the most severe cases, individuals traveling to neighboring jurisdictions with different closure policies can result in an outbreak that would otherwise have been contained. To motivate our mathematical models, we turn to mobility data and find that while stay-at-home orders and closures decreased contacts in most areas of the USA, some specific activities and venues saw an increase in attendance and an increase in the distance visitors traveled to attend. We support this finding using search query data, which clearly shows a shift in information seeking behavior concurrent with the changing mobility patterns.

    Conclusions

    While coarse-grained observations are not sufficient to validate our models, taken together, they highlight the potential unintended consequences of inconsistent epidemic control policies related to venue closure and stress the importance of balancing the societal needs of a population with the risk of an outbreak growing into a large epidemic.

     
    more » « less
  2. Abstract Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) the causal agent for COVID-19, is a communicable disease spread through close contact. It is known to disproportionately impact certain communities due to both biological susceptibility and inequitable exposure. In this study, we investigate the most important health, social, and environmental factors impacting the early phases (before July, 2020) of per capita COVID-19 transmission and per capita all-cause mortality in US counties. We aggregate county-level physical and mental health, environmental pollution, access to health care, demographic characteristics, vulnerable population scores, and other epidemiological data to create a large feature set to analyze per capita COVID-19 outcomes. Because of the high-dimensionality, multicollinearity, and unknown interactions of the data, we use ensemble machine learning and marginal prediction methods to identify the most salient factors associated with several COVID-19 outbreak measure. Our variable importance results show that measures of ethnicity, public transportation and preventable diseases are the strongest predictors for both per capita COVID-19 incidence and mortality. Specifically, the CDC measures for minority populations, CDC measures for limited English, and proportion of Black- and/or African-American individuals in a county were the most important features for per capita COVID-19 cases within a month after the pandemic started in a county and also at the latest date examined. For per capita all-cause mortality at day 100 and total to date, we find that public transportation use and proportion of Black- and/or African-American individuals in a county are the strongest predictors. The methods predict that, keeping all other factors fixed, a 10% increase in public transportation use, all other factors remaining fixed at the observed values, is associated with increases mortality at day 100 of 2012 individuals (95% CI [1972, 2356]) and likewise a 10% increase in the proportion of Black- and/or African-American individuals in a county is associated with increases total deaths at end of study of 2067 (95% CI [1189, 2654]). Using data until the end of study, the same metric suggests ethnicity has double the association as the next most important factors, which are location, disease prevalence, and transit factors. Our findings shed light on societal patterns that have been reported and experienced in the U.S. by using robust methods to understand the features most responsible for transmission and sectors of society most vulnerable to infection and mortality. In particular, our results provide evidence of the disproportionate impact of the COVID-19 pandemic on minority populations. Our results suggest that mitigation measures, including how vaccines are distributed, could have the greatest impact if they are given with priority to the highest risk communities. 
    more » « less
  3. null (Ed.)
    Abstract The objective of this study is to examine the transmission risk of COVID-19 based on cross-county population co-location data from Facebook. The rapid spread of COVID-19 in the United States has imposed a major threat to public health, the real economy, and human well-being. With the absence of effective vaccines, the preventive actions of social distancing, travel reduction and stay-at-home orders are recognized as essential non-pharmacologic approaches to control the infection and spatial spread of COVID-19. Prior studies demonstrated that human movement and mobility drove the spatiotemporal distribution of COVID-19 in China. Little is known, however, about the patterns and effects of co-location reduction on cross-county transmission risk of COVID-19. This study utilizes Facebook co-location data for all counties in the United States from March to early May 2020 for conducting spatial network analysis where nodes represent counties and edge weights are associated with the co-location probability of populations of the counties. The analysis examines the synchronicity and time lag between travel reduction and pandemic growth trajectory to evaluate the efficacy of social distancing in ceasing the population co-location probabilities, and subsequently the growth in weekly new cases across counties. The results show that the mitigation effects of co-location reduction appear in the growth of weekly new confirmed cases with one week of delay. The analysis categorizes counties based on the number of confirmed COVID-19 cases and examines co-location patterns within and across groups. Significant segregation is found among different county groups. The results suggest that within-group co-location probabilities (e.g., co-location probabilities among counties with high numbers of cases) remain stable, and social distancing policies primarily resulted in reduced cross-group co-location probabilities (due to travel reduction from counties with large number of cases to counties with low numbers of cases). These findings could have important practical implications for local governments to inform their intervention measures for monitoring and reducing the spread of COVID-19, as well as for adoption in future pandemics. Public policy, economic forecasting, and epidemic modeling need to account for population co-location patterns in evaluating transmission risk of COVID-19 across counties. 
    more » « less
  4. Abstract

    Lakes are conduits of greenhouse gases to the atmosphere; however, most efflux estimates for individual lakes are based on extrapolations from a limited number of locations. Within‐lake variability in carbon dioxide (CO2) and methane (CH4) arises from differences in water sources, mixing, atmospheric exchange, and biogeochemical transformations, all of which vary across multiple temporal and spatial scales. We asked, how variable are CO2and CH4across the surface of a single lake, how do spatial patterns change seasonally, and how well does the typical sampling location represent the entire lake surface? During the 2016 ice‐free period, we mapped surface water concentrations of CO2and CH4approximately weekly in Lake Mendota (USA) and modeled diffusive gas exchange. During stratification, CO2was generally lower than atmospheric saturation (mean 19.81 μM) and relatively homogenous (mean coefficient of variation 0.12), whereas CH4was routinely extremely supersaturated (mean 0.29 μM) with greater spatial heterogeneity (mean coefficient of variation 0.65). During fall mixis, concentrations of both gases increased and became more spatially variable, but their spatial arrangements differed. In this system, samples collected from the lake center reasonably well represented the spatially weighted mean CO2concentration but overestimated annual CO2efflux by 21%. For CH4, the lake center underestimated annual diffusive efflux by only 8.6% but poorly represented lakewide concentrations and fluxes on any given day. Upscaling from a single site to the whole lake requires consideration of spatial variation to assess lakewide carbon dynamics due to heterogeneity in within‐lake processing, transport to the lake surface, and exchange with the atmosphere.

     
    more » « less
  5. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less