skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: An evaluation framework for predictive models of neighbourhood change with applications to predicting residential sales in Buffalo, NY

New data and technologies, in particular machine learning, may make it possible to forecast neighbourhood change. Doing so may help, for example, to prevent the negative impacts of gentrification on marginalised communities. However, predictive models of neighbourhood change face four challenges: accuracy (are they right?), granularity (are they right at spatial or temporal scales that actually matter for a policy response?), bias (are they equitable?) and expert validity (do models and their predictions make sense to domain experts?). The present work provides a framework to evaluate the performance of predictive models of neighbourhood change along these four dimensions. We illustrate the application of our evaluation framework via a case study of Buffalo, NY, where we consider the following prediction task: given historical data, can we predict the percentage of residential buildings that will be sold or foreclosed on in a given area over a fixed amount of time into the future?

 
more » « less
Award ID(s):
1939579
PAR ID:
10448169
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  
Publisher / Repository:
SAGE Publications
Date Published:
Journal Name:
Urban Studies
Volume:
61
Issue:
5
ISSN:
0042-0980
Format(s):
Medium: X Size: p. 838-858
Size(s):
p. 838-858
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Aim

    Species distribution models (SDMs) are widely used to make predictions on how species distributions may change as a response to climatic change. To assess the reliability of those predictions, they need to be critically validated with respect to what they are used for. While ecologists are typically interested in how and where distributions will change, we argue that SDMs have seldom been evaluated in terms of their capacity to predict such change. Instead, typical retrospective validation methods estimate model's ability to predict to only one static time in future. Here, we apply two validation methods, one that predicts and evaluates a static pattern, while the other measures change and compare their estimates of predictive performance.

    Location

    Fennoscandia.

    Methods

    We applied a joint SDM to model the distributions of 120 bird species in four model validation settings. We trained models with a dataset from 1975 to 1999 and predicted species' future occurrence and abundance in two ways: for one static time period (2013–2016, ‘static validation’) and for a change between two time periods (difference between 1996–1999 and 2013–2016, ‘change validation’). We then measured predictive performance using correlation between predicted and observed values. We also related predictive performance to species traits.

    Results

    Even though static validation method evaluated predictive performance as good, change method indicated very poor performance. Predictive performance was not strongly related to any trait.

    Main Conclusions

    Static validation method might overestimate predictive performance by not revealing the model's inability to predict change events. If species' distributions remain mostly stable, then even an unfit model can predict the near future well due to temporal autocorrelation. We urge caution when working with forecasts of changes in spatial patterns of species occupancy or abundance, even for SDMs that are based on time series datasets unless they are critically validated for forecasting such change.

     
    more » « less
  2. Abstract Objective:

    Comprehensive studies examining longitudinal predictors of dietary change during the coronavirus disease 2019 pandemic are lacking. Based on an ecological framework, this study used longitudinal data to test if individual, social and environmental factors predicted change in dietary intake during the peak of the coronavirus 2019 pandemic in Los Angeles County and examined interactions among the multilevel predictors.

    Design:

    We analysed two survey waves (e.g. baseline and follow-up) of the Understanding America Study, administered online to the same participants 3 months apart. The surveys assessed dietary intake and individual, social, and neighbourhood factors potentially associated with diet. Lagged multilevel regression models were used to predict change from baseline to follow-up in daily servings of fruits, vegetables and sugar-sweetened beverages.

    Setting:

    Data were collected in October 2020 and January 2021, during the peak of the coronavirus disease 2019 pandemic in Los Angeles County.

    Participants:

    903 adults representative of Los Angeles County households.

    Results:

    Individuals who had depression and less education or who identified as non-Hispanic Black or Hispanic reported unhealthy dietary changes over the study period. Individuals with smaller social networks, especially low-income individuals with smaller networks, also reported unhealthy dietary changes. After accounting for individual and social factors, neighbourhood factors were generally not associated with dietary change.

    Conclusions:

    Given poor diets are a leading cause of death in the USA, addressing ecological risk factors that put some segments of the community at risk for unhealthy dietary changes during a crisis should be a priority for health interventions and policy.

     
    more » « less
  3. Many visual analytics systems allow users to interact with machine learning models towards the goals of data exploration and insight generation on a given dataset. However, in some situations, insights may be less important than the production of an accurate predictive model for future use. In that case, users are more interested in generating of diverse and robust predictive models, verifying their performance on holdout data, and selecting the most suitable model for their usage scenario. In this paper, we consider the concept of Exploratory Model Analysis (EMA), which is defined as the process of discovering and selecting relevant models that can be used to make predictions on a data source. We delineate the differences between EMA and the well‐known term exploratory data analysis in terms of the desired outcome of the analytic process: insights into the data or a set of deployable models. The contributions of this work are a visual analytics system workflow for EMA, a user study, and two use cases validating the effectiveness of the workflow. We found that our system workflow enabled users to generate complex models, to assess them for various qualities, and to select the most relevant model for their task. 
    more » « less
  4. In machine learning, predictors trained on a given data distribution are usually guaranteed to perform well for further examples from the same distribution on average. This often may involve disregarding or diminishing the predictive power on atypical examples; or, in more extreme cases, a data distribution may be composed of a mixture of individually “atypical” heterogeneous populations, and the kind of simple predictors we can train may find it difficult to fit all of these populations simultaneously. In such cases, we may wish to make predictions for an atypical point by selecting a suitable reference class for that point: a subset of the data that is “more similar” to the given query point in an appropriate sense. Closely related tasks also arise in applications such as diagnosis or explaining the output of classifiers. We present new algorithms for computing k-DNF reference classes and establish much stronger approximation guarantees for their error rates. 
    more » « less
  5. Beiko, Robert G (Ed.)
    ABSTRACT

    Inflammatory bowel disease (IBD) is characterized by complex etiology and a disrupted colonic ecosystem. We provide a framework for the analysis of multi-omic data, which we apply to study the gut ecosystem in IBD. Specifically, we train and validate models using data on the metagenome, metatranscriptome, virome, and metabolome from the Human Microbiome Project 2 IBD multi-omic database, with 1,785 repeated samples from 130 individuals (103 cases and 27 controls). After splitting the participants into training and testing groups, we used mixed-effects least absolute shrinkage and selection operator regression to select features for each omic. These features, with demographic covariates, were used to generate separate single-omic prediction scores. All four single-omic scores were then combined into a final regression to assess the relative importance of the individual omics and the predictive benefits when considered together. We identified several species, pathways, and metabolites known to be associated with IBD risk, and we explored the connections between data sets. Individually, metabolomic and viromic scores were more predictive than metagenomics or metatranscriptomics, and when all four scores were combined, we predicted disease diagnosis with a Nagelkerke’sR2of 0.46 and an area under the curve of 0.80 (95% confidence interval: 0.63, 0.98). Our work supports that some single-omic models for complex traits are more predictive than others, that incorporating multiple omic data sets may improve prediction, and that each omic data type provides a combination of unique and redundant information. This modeling framework can be extended to other complex traits and multi-omic data sets.

    IMPORTANCE

    Complex traits are characterized by many biological and environmental factors, such that multi-omic data sets are well-positioned to help us understand their underlying etiologies. We applied a prediction framework across multiple omics (metagenomics, metatranscriptomics, metabolomics, and viromics) from the gut ecosystem to predict inflammatory bowel disease (IBD) diagnosis. The predicted scores from our models highlighted key features and allowed us to compare the relative utility of each omic data set in single-omic versus multi-omic models. Our results emphasized the importance of metabolomics and viromics over metagenomics and metatranscriptomics for predicting IBD status. The greater predictive capability of metabolomics and viromics is likely because these omics serve as markers of lifestyle factors such as diet. This study provides a modeling framework for multi-omic data, and our results show the utility of combining multiple omic data types to disentangle complex disease etiologies and biological signatures.

     
    more » « less