skip to main content


Title: Ensemble machine learning of factors influencing COVID-19 across US counties
Abstract Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) the causal agent for COVID-19, is a communicable disease spread through close contact. It is known to disproportionately impact certain communities due to both biological susceptibility and inequitable exposure. In this study, we investigate the most important health, social, and environmental factors impacting the early phases (before July, 2020) of per capita COVID-19 transmission and per capita all-cause mortality in US counties. We aggregate county-level physical and mental health, environmental pollution, access to health care, demographic characteristics, vulnerable population scores, and other epidemiological data to create a large feature set to analyze per capita COVID-19 outcomes. Because of the high-dimensionality, multicollinearity, and unknown interactions of the data, we use ensemble machine learning and marginal prediction methods to identify the most salient factors associated with several COVID-19 outbreak measure. Our variable importance results show that measures of ethnicity, public transportation and preventable diseases are the strongest predictors for both per capita COVID-19 incidence and mortality. Specifically, the CDC measures for minority populations, CDC measures for limited English, and proportion of Black- and/or African-American individuals in a county were the most important features for per capita COVID-19 cases within a month after the pandemic started in a county and also at the latest date examined. For per capita all-cause mortality at day 100 and total to date, we find that public transportation use and proportion of Black- and/or African-American individuals in a county are the strongest predictors. The methods predict that, keeping all other factors fixed, a 10% increase in public transportation use, all other factors remaining fixed at the observed values, is associated with increases mortality at day 100 of 2012 individuals (95% CI [1972, 2356]) and likewise a 10% increase in the proportion of Black- and/or African-American individuals in a county is associated with increases total deaths at end of study of 2067 (95% CI [1189, 2654]). Using data until the end of study, the same metric suggests ethnicity has double the association as the next most important factors, which are location, disease prevalence, and transit factors. Our findings shed light on societal patterns that have been reported and experienced in the U.S. by using robust methods to understand the features most responsible for transmission and sectors of society most vulnerable to infection and mortality. In particular, our results provide evidence of the disproportionate impact of the COVID-19 pandemic on minority populations. Our results suggest that mitigation measures, including how vaccines are distributed, could have the greatest impact if they are given with priority to the highest risk communities.  more » « less
Award ID(s):
2032264
NSF-PAR ID:
10329390
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Scientific Reports
Volume:
11
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Importance

    Marked elevation in levels of depressive symptoms compared with historical norms have been described during the COVID-19 pandemic, and understanding the extent to which these are associated with diminished in-person social interaction could inform public health planning for future pandemics or other disasters.

    Objective

    To describe the association between living in a US county with diminished mobility during the COVID-19 pandemic and self-reported depressive symptoms, while accounting for potential local and state-level confounding factors.

    Design, Setting, and Participants

    This survey study used 18 waves of a nonprobability internet survey conducted in the United States between May 2020 and April 2022. Participants included respondents who were 18 years and older and lived in 1 of the 50 US states or Washington DC.

    Main Outcome and Measure

    Depressive symptoms measured by the Patient Health Questionnaire-9 (PHQ-9); county-level community mobility estimates from mobile apps; COVID-19 policies at the US state level from the Oxford stringency index.

    Results

    The 192 271 survey respondents had a mean (SD) of age 43.1 (16.5) years, and 768 (0.4%) were American Indian or Alaska Native individuals, 11 448 (6.0%) were Asian individuals, 20 277 (10.5%) were Black individuals, 15 036 (7.8%) were Hispanic individuals, 1975 (1.0%) were Pacific Islander individuals, 138 702 (72.1%) were White individuals, and 4065 (2.1%) were individuals of another race. Additionally, 126 381 respondents (65.7%) identified as female and 65 890 (34.3%) as male. Mean (SD) depression severity by PHQ-9 was 7.2 (6.8). In a mixed-effects linear regression model, the mean county-level proportion of individuals not leaving home was associated with a greater level of depression symptoms (β, 2.58; 95% CI, 1.57-3.58) after adjustment for individual sociodemographic features. Results were similar after the inclusion in regression models of local COVID-19 activity, weather, and county-level economic features, and persisted after widespread availability of COVID-19 vaccination. They were attenuated by the inclusion of state-level pandemic restrictions. Two restrictions, mandatory mask-wearing in public (β, 0.23; 95% CI, 0.15-0.30) and policies cancelling public events (β, 0.37; 95% CI, 0.22-0.51), demonstrated modest independent associations with depressive symptom severity.

    Conclusions and Relevance

    In this study, depressive symptoms were greater in locales and times with diminished community mobility. Strategies to understand the potential public health consequences of pandemic responses are needed.

     
    more » « less
  2. Turner, Richard (Ed.)
    Background With the availability of multiple Coronavirus Disease 2019 (COVID-19) vaccines and the predicted shortages in supply for the near future, it is necessary to allocate vaccines in a manner that minimizes severe outcomes, particularly deaths. To date, vaccination strategies in the United States have focused on individual characteristics such as age and occupation. Here, we assess the utility of population-level health and socioeconomic indicators as additional criteria for geographical allocation of vaccines. Methods and findings County-level estimates of 14 indicators associated with COVID-19 mortality were extracted from public data sources. Effect estimates of the individual indicators were calculated with univariate models. Presence of spatial autocorrelation was established using Moran’s I statistic. Spatial simultaneous autoregressive (SAR) models that account for spatial autocorrelation in response and predictors were used to assess (i) the proportion of variance in county-level COVID-19 mortality that can explained by identified health/socioeconomic indicators (R 2 ); and (ii) effect estimates of each predictor. Adjusting for case rates, the selected indicators individually explain 24%–29% of the variability in mortality. Prevalence of chronic kidney disease and proportion of population residing in nursing homes have the highest R 2 . Mortality is estimated to increase by 43 per thousand residents (95% CI: 37–49; p < 0.001) with a 1% increase in the prevalence of chronic kidney disease and by 39 deaths per thousand (95% CI: 34–44; p < 0.001) with 1% increase in population living in nursing homes. SAR models using multiple health/socioeconomic indicators explain 43% of the variability in COVID-19 mortality in US counties, adjusting for case rates. R 2 was found to be not sensitive to the choice of SAR model form. Study limitations include the use of mortality rates that are not age standardized, a spatial adjacency matrix that does not capture human flows among counties, and insufficient accounting for interaction among predictors. Conclusions Significant spatial autocorrelation exists in COVID-19 mortality in the US, and population health/socioeconomic indicators account for a considerable variability in county-level mortality. In the context of vaccine rollout in the US and globally, national and subnational estimates of burden of disease could inform optimal geographical allocation of vaccines. 
    more » « less
  3. Importance Prior research has established that Hispanic and non-Hispanic Black residents in the US experienced substantially higher COVID-19 mortality rates in 2020 than non-Hispanic White residents owing to structural racism. In 2021, these disparities decreased. Objective To assess to what extent national decreases in racial and ethnic disparities in COVID-19 mortality between the initial pandemic wave and subsequent Omicron wave reflect reductions in mortality vs other factors, such as the pandemic’s changing geography. Design, Setting, and Participants This cross-sectional study was conducted using data from the US Centers for Disease Control and Prevention for COVID-19 deaths from March 1, 2020, through February 28, 2022, among adults aged 25 years and older residing in the US. Deaths were examined by race and ethnicity across metropolitan and nonmetropolitan areas, and the national decrease in racial and ethnic disparities between initial and Omicron waves was decomposed. Data were analyzed from June 2021 through March 2023. Exposures Metropolitan vs nonmetropolitan areas and race and ethnicity. Main Outcomes and Measures Age-standardized death rates. Results There were death certificates for 977 018 US adults aged 25 years and older (mean [SD] age, 73.6 [14.6] years; 435 943 female [44.6%]; 156 948 Hispanic [16.1%], 140 513 non-Hispanic Black [14.4%], and 629 578 non-Hispanic White [64.4%]) that included a mention of COVID-19. The proportion of COVID-19 deaths among adults residing in nonmetropolitan areas increased from 5944 of 110 526 deaths (5.4%) during the initial wave to a peak of 40 360 of 172 515 deaths (23.4%) during the Delta wave; the proportion was 45 183 of 210 554 deaths (21.5%) during the Omicron wave. The national disparity in age-standardized COVID-19 death rates per 100 000 person-years for non-Hispanic Black compared with non-Hispanic White adults decreased from 339 to 45 deaths from the initial to Omicron wave, or by 293 deaths. After standardizing for age and racial and ethnic differences by metropolitan vs nonmetropolitan residence, increases in death rates among non-Hispanic White adults explained 120 deaths/100 000 person-years of the decrease (40.7%); 58 deaths/100 000 person-years in the decrease (19.6%) were explained by shifts in mortality to nonmetropolitan areas, where a disproportionate share of non-Hispanic White adults reside. The remaining 116 deaths/100 000 person-years in the decrease (39.6%) were explained by decreases in death rates in non-Hispanic Black adults. Conclusions and Relevance This study found that most of the national decrease in racial and ethnic disparities in COVID-19 mortality between the initial and Omicron waves was explained by increased mortality among non-Hispanic White adults and changes in the geographic spread of the pandemic. These findings suggest that despite media reports of a decline in disparities, there is a continued need to prioritize racial health equity in the pandemic response. 
    more » « less
  4. Background Digital surveillance tools and health informatics show promise in counteracting diseases but have limited uptake. A notable illustration of the limits of such tools is the general failure of digital contact tracing in the United States in response to COVID-19. Objective We investigated the associations between individual characteristics and the willingness to use app-based contact tracing in Detroit, a majority-minority city that experienced multiple waves of COVID-19 outbreaks and deaths since the start of the pandemic. The aim of this study was to examine variations among residents in the willingness to download a contact tracing app on their phones to provide public health officials with information about close COVID-19 contact during summer 2020. Methods To examine residents’ willingness to participate in digital contact tracing, we analyzed data from 2 waves of the Detroit Metro Area Communities Study, a population-based survey of Detroit, Michigan residents. The data captured 1873 responses from 991 Detroit residents collected in June and July 2020. We estimated a series of multilevel logit models to gain insights into differences in the willingness to participate in digital contact tracing across a variety of individual attributes, including race/ethnicity, degree of trust in the government, and level of education, as well as interactions among these variables. Results Our results reflected widespread reluctance to participate in digital contact tracing in response to COVID-19, as less than half (826/1873, 44.1%) of the respondents said they would be willing to participate in app-based contact tracing. Compared to White respondents, Black (odds ratio [OR] 0.45, 95% CI 0.23-0.86) and Latino (OR 0.32, 95% CI 0.11-0.99) respondents were significantly less willing to participate in digital contact tracing. Trust in the government was positively associated with the willingness to participate in digital contact tracing (OR 1.17, 95% CI 1.07-1.27), but this effect was the strongest for White residents (OR 2.14, 95% CI 1.55-2.93). We found similarly divergent patterns of the effects of education by race. While there were no significant differences among noncollege-educated residents, White college-educated residents showed greater willingness to use app-based contact tracing (OR 6.12, 95% CI 1.86-20.15) and Black college-educated residents showed less willingness (OR 0.46, 95% CI 0.26-0.81). Conclusions Trust in the government and education contribute to Detroit residents’ wariness of digital contact tracing, reflecting concerns about surveillance that cut across race but likely arise from different sources. These findings point to the importance of a culturally informed understanding of health hesitancy for future efforts hoping to leverage digital contact tracing. Though contact tracing technologies have the potential to advance public health, unequal uptake may exacerbate disparate impacts of health crises. 
    more » « less
  5. Abstract This project is funded by the US National Science Foundation (NSF) through their NSF RAPID program under the title “Modeling Corona Spread Using Big Data Analytics.” The project is a joint effort between the Department of Computer & Electrical Engineering and Computer Science at FAU and a research group from LexisNexis Risk Solutions. The novel coronavirus Covid-19 originated in China in early December 2019 and has rapidly spread to many countries around the globe, with the number of confirmed cases increasing every day. Covid-19 is officially a pandemic. It is a novel infection with serious clinical manifestations, including death, and it has reached at least 124 countries and territories. Although the ultimate course and impact of Covid-19 are uncertain, it is not merely possible but likely that the disease will produce enough severe illness to overwhelm the worldwide health care infrastructure. Emerging viral pandemics can place extraordinary and sustained demands on public health and health systems and on providers of essential community services. Modeling the Covid-19 pandemic spread is challenging. But there are data that can be used to project resource demands. Estimates of the reproductive number (R) of SARS-CoV-2 show that at the beginning of the epidemic, each infected person spreads the virus to at least two others, on average (Emanuel et al. in N Engl J Med. 2020, Livingston and Bucher in JAMA 323(14):1335, 2020). A conservatively low estimate is that 5 % of the population could become infected within 3 months. Preliminary data from China and Italy regarding the distribution of case severity and fatality vary widely (Wu and McGoogan in JAMA 323(13):1239–42, 2020). A recent large-scale analysis from China suggests that 80 % of those infected either are asymptomatic or have mild symptoms; a finding that implies that demand for advanced medical services might apply to only 20 % of the total infected. Of patients infected with Covid-19, about 15 % have severe illness and 5 % have critical illness (Emanuel et al. in N Engl J Med. 2020). Overall, mortality ranges from 0.25 % to as high as 3.0 % (Emanuel et al. in N Engl J Med. 2020, Wilson et al. in Emerg Infect Dis 26(6):1339, 2020). Case fatality rates are much higher for vulnerable populations, such as persons over the age of 80 years (> 14 %) and those with coexisting conditions (10 % for those with cardiovascular disease and 7 % for those with diabetes) (Emanuel et al. in N Engl J Med. 2020). Overall, Covid-19 is substantially deadlier than seasonal influenza, which has a mortality of roughly 0.1 %. Public health efforts depend heavily on predicting how diseases such as those caused by Covid-19 spread across the globe. During the early days of a new outbreak, when reliable data are still scarce, researchers turn to mathematical models that can predict where people who could be infected are going and how likely they are to bring the disease with them. These computational methods use known statistical equations that calculate the probability of individuals transmitting the illness. Modern computational power allows these models to quickly incorporate multiple inputs, such as a given disease’s ability to pass from person to person and the movement patterns of potentially infected people traveling by air and land. This process sometimes involves making assumptions about unknown factors, such as an individual’s exact travel pattern. By plugging in different possible versions of each input, however, researchers can update the models as new information becomes available and compare their results to observed patterns for the illness. In this paper we describe the development a model of Corona spread by using innovative big data analytics techniques and tools. We leveraged our experience from research in modeling Ebola spread (Shaw et al. Modeling Ebola Spread and Using HPCC/KEL System. In: Big Data Technologies and Applications 2016 (pp. 347-385). Springer, Cham) to successfully model Corona spread, we will obtain new results, and help in reducing the number of Corona patients. We closely collaborated with LexisNexis, which is a leading US data analytics company and a member of our NSF I/UCRC for Advanced Knowledge Enablement. The lack of a comprehensive view and informative analysis of the status of the pandemic can also cause panic and instability within society. Our work proposes the HPCC Systems Covid-19 tracker, which provides a multi-level view of the pandemic with the informative virus spreading indicators in a timely manner. The system embeds a classical epidemiological model known as SIR and spreading indicators based on causal model. The data solution of the tracker is built on top of the Big Data processing platform HPCC Systems, from ingesting and tracking of various data sources to fast delivery of the data to the public. The HPCC Systems Covid-19 tracker presents the Covid-19 data on a daily, weekly, and cumulative basis up to global-level and down to the county-level. It also provides statistical analysis for each level such as new cases per 100,000 population. The primary analysis such as Contagion Risk and Infection State is based on causal model with a seven-day sliding window. Our work has been released as a publicly available website to the world and attracted a great volume of traffic. The project is open-sourced and available on GitHub. The system was developed on the LexisNexis HPCC Systems, which is briefly described in the paper. 
    more » « less