skip to main content


Title: Spatial thinning and class balancing: Key choices lead to variation in the performance of species distribution models with citizen science data
Abstract

Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique.

To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics.

We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration.

Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

 
more » « less
NSF-PAR ID:
10453579
Author(s) / Creator(s):
 ;  ;  ;  ;
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Methods in Ecology and Evolution
Volume:
12
Issue:
2
ISSN:
2041-210X
Page Range / eLocation ID:
p. 216-226
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Species distribution models (SDMs) are frequently used to predict the effects of climate change on species of conservation concern. Biases inherent in the process of constructing SDMs and transferring them to new climate scenarios may result in undesirable conservation outcomes. We explore these issues and demonstrate new methods to estimate biases induced by the design of SDM studies. We present these methods in the context of estimating the effects of climate change on Australia's only endemic Pokémon.

    Using a citizen science dataset, we build species distribution models forGarura kangaskhanito predict the effects of climate change on the suitability of habitat for the species. We demonstrate a novel Monte Carlo procedure for estimating the biases implicit in a given study design, and compare the results seen for Pokémon to those seen from our Monte Carlo tests as well as previous studies in the same region using both simulated and real data.

    Our models suggest that climate change will impact the suitability of habitat forG. kangaskhani, which may compound the effects of threats such as habitat loss and their use in blood sport. However, we also find that using SDMs to estimate the effects of climate change can be accompanied by biases so strong that the data themselves have minimal impact on modelling outcomes.

    We show that the direction and magnitude of bias in estimates of climate change impacts are affected by every aspect of the modelling process, and suggest that bias estimates should be included in future studies of this type. Given the widespread use of SDMs, systemic biases could have substantial financial and opportunity costs. By demonstrating these biases and presenting a novel statistical tool to estimate them, we hope to provide a more secure future forG. kangaskhaniand the rest of the world's biodiversity.

     
    more » « less
  2. Abstract Aim

    Species distribution models (SDMs) that integrate presence‐only and presence–absence data offer a promising avenue to improve information on species' geographic distributions. The use of such ‘integrated SDMs’ on a species range‐wide extent has been constrained by the often limited presence–absence data and by the heterogeneous sampling of the presence‐only data. Here, we evaluate integrated SDMs for studying species ranges with a novel expert range map‐based evaluation. We build new understanding about how integrated SDMs address issues of estimation accuracy and data deficiency and thereby offer advantages over traditional SDMs.

    Location

    South and Central America.

    Time Period

    1979–2017.

    Major Taxa Studied

    Hummingbirds.

    Methods

    We build integrated SDMs by linking two observation models – one for each data type – to the same underlying spatial process. We validate SDMs with two schemes: (i) cross‐validation with presence–absence data and (ii) comparison with respect to the species' whole range as defined with IUCN range maps. We also compare models relative to the estimated response curves and compute the association between the benefit of the data integration and the number of presence records in each data set.

    Results

    The integrated SDM accounting for the spatially varying sampling intensity of the presence‐only data was one of the top performing models in both model validation schemes. Presence‐only data alleviated overly large niche estimates, and data integration was beneficial compared to modelling solely presence‐only data for species which had few presence points when predicting the species' whole range. On the community level, integrated models improved the species richness prediction.

    Main Conclusions

    Integrated SDMs combining presence‐only and presence–absence data are successfully able to borrow strengths from both data types and offer improved predictions of species' ranges. Integrated SDMs can potentially alleviate the impacts of taxonomically and geographically uneven sampling and to leverage the detailed sampling information in presence–absence data.

     
    more » « less
  3. Abstract

    Species' ranges are changing at accelerating rates. Species distribution models (SDMs) are powerful tools that help rangers and decision‐makers prepare for reintroductions, range shifts, reductions and/or expansions by predicting habitat suitability across landscapes. Yet, range‐expanding or ‐shifting species in particular face other challenges that traditional SDM procedures cannot quantify, due to large differences between a species' currently occupied range and potential future range. The realism of SDMs is thus lost and not as useful for conservation management in practice. Here, we address these challenges with an extended assessment of habitat suitability through anintegrated SDM database(iSDMdb).

    TheiSDMdbis a spatial database of predicted sites in a species' prediction range, derived from SDM results, and is a single spatial feature that contains additional, user‐friendly data fields that synthesise and summarise SDM predictions and uncertainty, human impacts, restoration features, novel preferences in novel spaces and management priorities. To illustrate its utility, we used the endangered New Zealand sea lionPhocarctos hookeri. We consulted with wildlife rangers, decision‐makers and sea lion experts to supplement SDM predictions with additional, more realistic and applicable information for management.

    Almost half the data fields included in this database resulted from engaging with these end‐users during our study. The SDM found 395 predicted sites. However, theiSDMdb's additional assessments showed that the actual suitability of most sites (90%) was questionable due to human impacts. >50% of sites contained unnatural barriers (fences, grazing grasslands), and 75% of sites had roads located within the species' range of inland movement. Just 5% of the predicted sites were mostly (>80%) protected.

    Integrating SDM results with supplemental assessments provides a way to address SDM limitations, especially for range‐expanding or ‐shifting species. SDM products for conservation applications have been critiqued for lacking transparency and interpretation support, and ineffectively communicating uncertainty. TheiSDMdbaddresses these issues and enhances the practical relevance and utility of SDMs for stakeholders, rangers and decision‐makers. We exemplify how to build aniSDMdbusing open‐source tools, and how to make diverse, complex assessments more accessible for end‐users.

     
    more » « less
  4. Abstract

    The area under the curve (AUC) of the receiving‐operating characteristic (or certain modifications of it) is almost universally used to assess the performance of species distribution models (SDMs), despite the well‐recognized problems encountered with this approach, mainly present when dealing with presence‐only data.

    We present a probabilistic treatment of the presence‐only problem and derive a method to assess the performance of SDMs based on the analysis of an area‐presence plot and the SDM outputs represented in both geographic and environmental spaces.

    We show how our method is useful to solve the two main tasks for which the AUC is used: assessing the performance of an SDM and comparing the performance of different SDMs. Our results build on previous work and constitute a rigorous method for assessing the performance of SDMs in relation to a random classifier.

    We establish comparisons with two of the most popular approaches used to assess the performance of an SDM, the AUC and the Boyce index, and identified cases in which our method has advantages over these two approaches.

    We suggest that the performance of an algorithm that classifies presence‐only data can be assessed by two factors: (a) the degree of non‐randomness of the classification at every step in the accumulation curve of presences, and (b) the amount of uninformative niche space used for the classification. The method we developed can be applied to any SDM output by using the R functions available at:https://github.com/LauraJim/SDM‐hyperTest.

     
    more » « less
  5. Abstract Aim

    Species distribution models (SDMs) are ubiquitous in ecology to predict species occurrence throughout their range. Typically, SDMs are created using presence‐only or presence–absence data. We hypothesize that the continuous metric of temporal occupancy, the proportion of time a species is observed at a given site, provides more detail about species occurrence than binary presence‐based SDMs.

    Location

    North America.

    Methods

    We compared SDMs for 189 focal species using four modelling methods to determine whether North American avian species distributions are better predicted using temporal occupancy over presence–absence. We used the North American Breeding Bird Survey and built SDMs based on all sites sampled consecutively between 2001 and 2015, as well as on a subset of only five time points within the 15‐year sampling window. Each model used the same environmental inputs to predict species range. Each SDM was cross‐validated temporally and spatially.

    Results

    Species distributions were generally better predicted using temporal occupancy rather than presence–absence when using either a five‐year or fifteen‐year sampling window. Species that occurred in a smaller proportion of their predicted range were particularly better predicted with SDMs using temporal occupancy. Temporal occupancy SDMs had lower false discovery and false‐positive rates but higher false‐negative rates than presence–absence models.

    Main conclusions

    Temporal occupancy is a valuable metric that can improve predictions of species occurrence for birds and may improve conservation planning and design efforts.

     
    more » « less