skip to main content

Title: Hierarchical multivariate directed acyclic graph autoregressive models for spatial diseases mapping

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter‐disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter‐disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Medium: X Size: p. 3057-3075
["p. 3057-3075"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Turner, Richard (Ed.)
    Background With the availability of multiple Coronavirus Disease 2019 (COVID-19) vaccines and the predicted shortages in supply for the near future, it is necessary to allocate vaccines in a manner that minimizes severe outcomes, particularly deaths. To date, vaccination strategies in the United States have focused on individual characteristics such as age and occupation. Here, we assess the utility of population-level health and socioeconomic indicators as additional criteria for geographical allocation of vaccines. Methods and findings County-level estimates of 14 indicators associated with COVID-19 mortality were extracted from public data sources. Effect estimates of the individual indicators were calculated with univariate models. Presence of spatial autocorrelation was established using Moran’s I statistic. Spatial simultaneous autoregressive (SAR) models that account for spatial autocorrelation in response and predictors were used to assess (i) the proportion of variance in county-level COVID-19 mortality that can explained by identified health/socioeconomic indicators (R 2 ); and (ii) effect estimates of each predictor. Adjusting for case rates, the selected indicators individually explain 24%–29% of the variability in mortality. Prevalence of chronic kidney disease and proportion of population residing in nursing homes have the highest R 2 . Mortality is estimated to increase by 43 per thousand residents (95% CI: 37–49; p < 0.001) with a 1% increase in the prevalence of chronic kidney disease and by 39 deaths per thousand (95% CI: 34–44; p < 0.001) with 1% increase in population living in nursing homes. SAR models using multiple health/socioeconomic indicators explain 43% of the variability in COVID-19 mortality in US counties, adjusting for case rates. R 2 was found to be not sensitive to the choice of SAR model form. Study limitations include the use of mortality rates that are not age standardized, a spatial adjacency matrix that does not capture human flows among counties, and insufficient accounting for interaction among predictors. Conclusions Significant spatial autocorrelation exists in COVID-19 mortality in the US, and population health/socioeconomic indicators account for a considerable variability in county-level mortality. In the context of vaccine rollout in the US and globally, national and subnational estimates of burden of disease could inform optimal geographical allocation of vaccines. 
    more » « less
  2. Summary

    Regional aggregates of health outcomes over delineated administrative units (e.g., states, counties, and zip codes), or areal units, are widely used by epidemiologists to map mortality or incidence rates and capture geographic variation. To capture health disparities over regions, we seek “difference boundaries” that separate neighboring regions with significantly different spatial effects. Matters are more challenging with multiple outcomes over each unit, where we capture dependence among diseases as well as across the areal units. Here, we address multivariate difference boundary detection for correlated diseases. We formulate the problem in terms of Bayesian pairwise multiple comparisons and seek the posterior probabilities of neighboring spatial effects being different. To achieve this, we endow the spatial random effects with a discrete probability law using a class of multivariate areally referenced Dirichlet process models that accommodate spatial and interdisease dependence. We evaluate our method through simulation studies and detect difference boundaries for multiple cancers using data from the Surveillance, Epidemiology, and End Results Program of the National Cancer Institute.

    more » « less
  3. Abstract

    Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.

    more » « less
  4. Abstract Aim

    We may be able to buffer biodiversity against the effects of ongoing climate change by prioritizing the protection of habitat with diverse physical features (high geodiversity) associated with ecological and evolutionary mechanisms that maintain high biodiversity. Nonetheless, the relationships between biodiversity and habitat vary with spatial and biological context. In this study, we compare how well habitat geodiversity (spatial variation in abiotic processes and features) and climate explain biodiversity patterns of birds and trees. We also evaluate the consistency of biodiversity–geodiversity relationships across ecoregions.


    Contiguous USA.

    Time period


    Taxa studied

    Birds and trees.


    We quantified geodiversity with remotely sensed data and generated biodiversity maps from the Forest Inventory and Analysis and Breeding Bird Survey datasets. We fitted multivariate regressions to alpha, beta and gamma diversity, accounting for spatial autocorrelation among Nature Conservancy ecoregions and relationships among taxonomic, phylogenetic and functional biodiversity. We fitted models including climate alone (temperature and precipitation), geodiversity alone (topography, soil and geology) and climate plus geodiversity.


    A combination of geodiversity and climate predictor variables fitted most forms of bird and tree biodiversity with < 10% relative error. Models using geodiversity and climate performed better for local (alpha) and regional (gamma) diversity than for turnover‐based (beta) diversity. Among geodiversity predictors, variability of elevation fitted biodiversity best; interestingly, topographically diverse places tended to have higher tree diversity but lower bird diversity.

    Main conclusions

    Although climatic predictors tended to have larger individual effects than geodiversity, adding geodiversity improved climate‐only models of biodiversity. Geodiversity was correlated with biodiversity more consistently than with climate across ecoregions, but models tended to have a poor fit in ecoregions held out of the training dataset. Patterns of geodiversity could help to prioritize conservation efforts within ecoregions. However, we need to understand the underlying mechanisms more fully before we can build models transferable across ecoregions.

    more » « less
  5. Abstract

    Occupancy modelling is a common approach to assess species distribution patterns, while explicitly accounting for false absences in detection–nondetection data. Numerous extensions of the basic single‐species occupancy model exist to model multiple species, spatial autocorrelation and to integrate multiple data types. However, development of specialized and computationally efficient software to incorporate such extensions, especially for large datasets, is scarce or absent.

    We introduce thespOccupancy Rpackage designed to fit single‐species and multi‐species spatially explicit occupancy models. We fit all models within a Bayesian framework using Pólya‐Gamma data augmentation, which results in fast and efficient inference.spOccupancyprovides functionality for data integration of multiple single‐species detection–nondetection datasets via a joint likelihood framework. The package leverages Nearest Neighbour Gaussian Processes to account for spatial autocorrelation, which enables spatially explicit occupancy modelling for potentially massive datasets (e.g. 1,000s–100,000s of sites).

    spOccupancyprovides user‐friendly functions for data simulation, model fitting, model validation (by posterior predictive checks), model comparison (using information criteria and k‐fold cross‐validation) and out‐of‐sample prediction. We illustrate the package's functionality via a vignette, simulated data analysis and two bird case studies.

    ThespOccupancypackage provides a user‐friendly platform to fit a variety of single and multi‐species occupancy models, making it straightforward to address detection biases and spatial autocorrelation in species distribution models even for large datasets.

    more » « less