skip to main content


Title: From calibration to parameter learning: Harnessing the scaling effects of big data in geoscientific modeling
Abstract

The behaviors and skills of models in many geosciences (e.g., hydrology and ecosystem sciences) strongly depend on spatially-varying parameters that need calibration. A well-calibrated model can reasonably propagate information from observations to unobserved variables via model physics, but traditional calibration is highly inefficient and results in non-unique solutions. Here we propose a novel differentiable parameter learning (dPL) framework that efficiently learns a global mapping between inputs (and optionally responses) and parameters. Crucially, dPL exhibits beneficial scaling curves not previously demonstrated to geoscientists: as training data increases, dPL achieves better performance, more physical coherence, and better generalizability (across space and uncalibrated variables), all with orders-of-magnitude lower computational cost. We demonstrate examples that learned from soil moisture and streamflow, where dPL drastically outperformed existing evolutionary and regionalization methods, or required only ~12.5% of the training data to achieve similar performance. The generic scheme promotes the integration of deep learning and process-based models, without mandating reimplementation.

 
more » « less
Award ID(s):
1940190 1832294
NSF-PAR ID:
10305836
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Volume:
12
Issue:
1
ISSN:
2041-1723
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Thenkabail, Prasad S. (Ed.)

    Physically based hydrologic models require significant effort and extensive information for development, calibration, and validation. The study explored the use of the random forest regression (RFR), a supervised machine learning (ML) model, as an alternative to the physically based Soil and Water Assessment Tool (SWAT) for predicting streamflow in the Rio Grande Headwaters near Del Norte, a snowmelt-dominated mountainous watershed of the Upper Rio Grande Basin. Remotely sensed data were used for the random forest machine learning analysis (RFML) and RStudio for data processing and synthesizing. The RFML model outperformed the SWAT model in accuracy and demonstrated its capability in predicting streamflow in this region. We implemented a customized approach to the RFR model to assess the model’s performance for three training periods, across 1991–2010, 1996–2010, and 2001–2010; the results indicated that the model’s accuracy improved with longer training periods, implying that the model trained on a more extended period is better able to capture the parameters’ variability and reproduce streamflow data more accurately. The variable importance (i.e., IncNodePurity) measure of the RFML model revealed that the snow depth and the minimum temperature were consistently the top two predictors across all training periods. The paper also evaluated how well the SWAT model performs in reproducing streamflow data of the watershed with a conventional approach. The SWAT model needed more time and data to set up and calibrate, delivering acceptable performance in annual mean streamflow simulation, with satisfactory index of agreement (d), coefficient of determination (R2), and percent bias (PBIAS) values, but monthly simulation warrants further exploration and model adjustments. The study recommends exploring snowmelt runoff hydrologic processes, dust-driven sublimation effects, and more detailed topographic input parameters to update the SWAT snowmelt routine for better monthly flow estimation. The results provide a critical analysis for enhancing streamflow prediction, which is valuable for further research and water resource management, including snowmelt-driven semi-arid regions.

     
    more » « less
  2. Abstract

    Predicting infectious disease dynamics is a central challenge in disease ecology. Models that can assess which individuals are most at risk of being exposed to a pathogen not only provide valuable insights into disease transmission and dynamics but can also guide management interventions. Constructing such models for wild animal populations, however, is particularly challenging; often only serological data are available on a subset of individuals and nonlinear relationships between variables are common.

    Here we provide a guide to the latest advances in statistical machine learning to construct pathogen‐risk models that automatically incorporate complex nonlinear relationships with minimal statistical assumptions from ecological data with missing data. Our approach compares multiple machine learning algorithms in a unified environment to find the model with the best predictive performance and uses game theory to better interpret results. We apply this framework on two major pathogens that infect African lions: canine distemper virus (CDV) and feline parvovirus.

    Our modelling approach provided enhanced predictive performance compared to more traditional approaches, as well as new insights into disease risks in a wild population. We were able to efficiently capture and visualize strong nonlinear patterns, as well as model complex interactions between variables in shaping exposure risk from CDV and feline parvovirus. For example, we found that lions were more likely to be exposed to CDV at a young age but only in low rainfall years.

    When combined with our data calibration approach, our framework helped us to answer questions about risk of pathogen exposure that are difficult to address with previous methods. Our framework not only has the potential to aid in predicting disease risk in animal populations, but also can be used to build robust predictive models suitable for other ecological applications such as modelling species distribution or diversity patterns.

     
    more » « less
  3. null (Ed.)
    Abstract Measuring soil health indicators (SHIs), particularly soil total nitrogen (TN), is an important and challenging task that affects farmers’ decisions on timing, placement, and quantity of fertilizers applied in the farms. Most existing methods to measure SHIs are in-lab wet chemistry or spectroscopy-based methods, which require significant human input and effort, time-consuming, costly, and are low-throughput in nature. To address this challenge, we develop an artificial intelligence (AI)-driven near real-time unmanned aerial vehicle (UAV)-based multispectral sensing solution (UMS) to estimate soil TN in an agricultural farm. TN is an important macro-nutrient or SHI that directly affects the crop health. Accurate prediction of soil TN can significantly increase crop yield through informed decision making on the timing of seed planting, and fertilizer quantity and timing. The ground-truth data required to train the AI approaches is generated via laser-induced breakdown spectroscopy (LIBS), which can be readily used to characterize soil samples, providing rapid chemical analysis of the samples and their constituents (e.g., nitrogen, potassium, phosphorus, calcium). Although LIBS was previously applied for soil nutrient detection, there is no existing study on the integration of LIBS with UAV multispectral imaging and AI. We train two machine learning (ML) models including multi-layer perceptron regression and support vector regression to predict the soil nitrogen using a suite of data classes including multispectral characteristics of the soil and crops in red (R), near-infrared, and green (G) spectral bands, computed vegetation indices (NDVI), and environmental variables including air temperature and relative humidity (RH). To generate the ground-truth data or the training data for the machine learning models, we determine the N spectrum of the soil samples (collected from a farm) using LIBS and develop a calibration model using the correlation between actual TN of the soil samples and the maximum intensity of N spectrum. In addition, we extract the features from the multispectral images captured while the UAV follows an autonomous flight plan, at different growth stages of the crops. The ML model’s performance is tested on a fixed configuration space for the hyper-parameters using various hyper-parameter optimization techniques at three different wavelengths of the N spectrum. 
    more » « less
  4. Pollard, Tom J. (Ed.)
    Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation. 
    more » « less
  5. Abstract

    Deep generative learning cannot only be used for generating new data with statistical characteristics derived from input data but also for anomaly detection, by separating nominal and anomalous instances based on their reconstruction quality. In this paper, we explore the performance of three unsupervised deep generative models—variational autoencoders (VAEs) with Gaussian, Bernoulli, and Boltzmann priors—in detecting anomalies in multivariate time series of commercial-flight operations. We created two VAE models with discrete latent variables (DVAEs), one with a factorized Bernoulli prior and one with a restricted Boltzmann machine (RBM) with novel positive-phase architecture as prior, because of the demand for discrete-variable models in machine-learning applications and because the integration of quantum devices based on two-level quantum systems requires such models. To the best of our knowledge, our work is the first that applies DVAE models to anomaly-detection tasks in the aerospace field. The DVAE with RBM prior, using a relatively simple—and classically or quantum-mechanically enhanceable—sampling technique for the evolution of the RBM’s negative phase, performed better in detecting anomalies than the Bernoulli DVAE and on par with the Gaussian model, which has a continuous latent space. The transfer of a model to an unseen dataset with the same anomaly but without re-tuning of hyperparameters or re-training noticeably impaired anomaly-detection performance, but performance could be improved by post-training on the new dataset. The RBM model was robust to change of anomaly type and phase of flight during which the anomaly occurred. Our studies demonstrate the competitiveness of a discrete deep generative model with its Gaussian counterpart on anomaly-detection problems. Moreover, the DVAE model with RBM prior can be easily integrated with quantum sampling by outsourcing its generative process to measurements of quantum states obtained from a quantum annealer or gate-model device.

     
    more » « less