skip to main content

Title: Evaluation of calibration subsetting and new chemometric methods on the spectral prediction of key soil properties in a data‐limited environment
Summary Highlights

Explored new calibration subsetting methods and chemometric models in soil spectral modelling.

Compared the methods and models for 17 soil properties in an understudied area of India.

Random subsetting was not always optimal; subsetting matters and depends on data characteristics.

Sparse models from genomics performed better in 75% of cases than a standard method.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
European Journal of Soil Science
Page Range / eLocation ID:
p. 107-126
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique.

    To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics.

    We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration.

    Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

    more » « less
  2. Abstract INTRODUCTION

    Identifying mild cognitive impairment (MCI) patients at risk for dementia could facilitate early interventions. Using electronic health records (EHRs), we developed a model to predict MCI to all‐cause dementia (ACD) conversion at 5 years.


    Cox proportional hazards model was used to identify predictors of ACD conversion from EHR data in veterans with MCI. Model performance (area under the receiver operating characteristic curve [AUC] and Brier score) was evaluated on a held‐out data subset.


    Of 59,782 MCI patients, 15,420 (25.8%) converted to ACD. The model had good discriminative performance (AUC 0.73 [95% confidence interval (CI) 0.72–0.74]), and calibration (Brier score 0.18 [95% CI 0.17–0.18]). Age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors, while body mass index, alcohol abuse, and sleep apnea were protective factors.


    EHR‐based prediction model had good performance in identifying 5‐year MCI to ACD conversion and has potential to assist triaging of at‐risk patients.


    Of 59,782 veterans with mild cognitive impairment (MCI), 15,420 (25.8%) converted to all‐cause dementia within 5 years.

    Electronic health record prediction models demonstrated good performance (area under the receiver operating characteristic curve 0.73; Brier 0.18).

    Age and vascular‐related morbidities were predictors of dementia conversion.

    Synthetic data was comparable to real data in modeling MCI to dementia conversion.

    Key Points

    An electronic health record–based model using demographic and co‐morbidity data had good performance in identifying veterans who convert from mild cognitive impairment (MCI) to all‐cause dementia (ACD) within 5 years.

    Increased age, stroke, cerebrovascular disease, myocardial infarction, hypertension, and diabetes were risk factors for 5‐year conversion from MCI to ACD.

    High body mass index, alcohol abuse, and sleep apnea were protective factors for 5‐year conversion from MCI to ACD.

    Models using synthetic data, analogs of real patient data that retain the distribution, density, and covariance between variables of real patient data but are not attributable to any specific patient, performed just as well as models using real patient data. This could have significant implications in facilitating widely distributed computing of health‐care data with minimized patient privacy concern that could accelerate scientific discoveries.

    more » « less
  3. Abstract

    Understanding soil organic carbon (SOC) response to global change has been hindered by an inability to map SOC at horizon scales relevant to coupled hydrologic and biogeochemical processes. Standard SOC measurements rely on homogenized samples taken from distinct depth intervals. Such sampling prevents an examination of fine‐scale SOC distribution within a soil horizon. Visible near‐infrared hyperspectral imaging (HSI) has been applied to intact monoliths and split cores surfaces to overcome this limitation. However, the roughness of these surfaces can influence HSI spectra by scattering reflected light in different directions posing challenges to fine‐scale SOC mapping. Here, we examine the influence of prescribed surface orientation on reflected spectra, develop a method for correcting topographic effects, and calibrate a partial least squares regression (PLSR) model for SOC prediction. Two empirical models that account for surface slope, aspect, and wavelength and two theoretical models that account for the geometry of the spectrometer were compared using 681 homogenized soil samples from across the United States that were packed into sample wells and presented to the spectrometer at 91 orientations. The empirical approach outperformed the more complex geometric models in correcting spectra taken at non‐flat configurations. Topographically corrected spectra reduced bias and error in SOC predicted by PLSR, particularly at slope angles greater than 30°. Our approach clears the way for investigating the spatial distributions of multiple soil properties on rough intact soil samples.

    more » « less
  4. Summary

    Understanding the genetic and physiological basis of abiotic stress tolerance under field conditions is key to varietal crop improvement in the face of climate variability. Here, we investigate dynamic physiological responses to water stressin silicoand their relationships to genotypic variation in hydraulic traits of cotton (Gossypium hirsutum), an economically important species for renewable textile fiber production.

    In conjunction with an ecophysiological process‐based model, heterogeneous data (plant hydraulic traits, spatially‐distributed soil texture, soil water content and canopy temperature) were used to examine hydraulic characteristics of cotton, evaluate their consequences on whole plant performance under drought, and explore potential genotype × environment effects.

    Cotton was found to have R‐shaped hydraulic vulnerability curves (VCs), which were consistent under drought stress initiated at flowering. Stem VCs, expressed as percent loss of conductivity, differed across genotypes, whereas root VCs did not. Simulation results demonstrated how plant physiological stress can depend on the interaction between soil properties and irrigation management, which in turn affect genotypic rankings of transpiration in a time‐dependent manner.

    Our study shows how a process‐based modeling framework can be used to link genotypic variation in hydraulic traits to differential acclimating behaviors under drought.

    more » « less
  5. Abstract

    In savannas, partitioning of below‐ground resources by depth could facilitate tree–grass coexistence and shape vegetation responses to changing rainfall patterns. However, most studies assessing tree versus grass root‐niche partitioning have focused on one or two sites, limiting generalization about how rainfall and soil conditions influence the degree of rooting overlap across environmental gradients.

    We used two complementary stable isotope techniques to quantify variation (a) in water uptake depths and (b) in fine‐root biomass distributions among dominant trees and grasses at eight semi‐arid savanna sites in Kruger National Park, South Africa. Sites were located on contrasting soil textures (clayey basaltic soils vs. sandy granitic soils) and paired along a gradient of mean annual rainfall.

    Soil texture predicted variation in mean water uptake depths and fine‐root allocation. While grasses maintained roots close to the surface and consistently used shallow water, trees on sandy soils distributed roots more evenly across soil depths and used deeper soil water, resulting in greater divergence between tree and grass rooting on sandy soils. Mean annual rainfall predicted some variation among sites in tree water uptake depth, but had a weaker influence on fine‐root allocation.

    Synthesis. Savanna trees overlapped more with shallow‐rooted grasses on clayey soils and were more distinct in their use of deeper soil layers on sandy soils, consistent with expected differences in infiltration and percolation. These differences, which could allow trees to escape grass competition more effectively on sandy soils, may explain observed differences in tree densities and rates of woody encroachment with soil texture. Differences in the degree of root‐niche separation could also drive heterogeneous responses of savanna vegetation to predicted shifts in the frequency and intensity of rainfall.

    more » « less