skip to main content


Title: Functional random forests for curve response
Abstract

The rapid advancement of functional data in various application fields has increased the demand for advanced statistical approaches that can incorporate complex structures and nonlinear associations. In this article, we propose a novel functional random forests (FunFor) approach to model the functional data response that is densely and regularly measured, as an extension of the landmark work of Breiman, who introduced traditional random forests for a univariate response. The FunFor approach is able to predict curve responses for new observations and selects important variables from a large set of scalar predictors. The FunFor approach inherits the efficiency of the traditional random forest approach in detecting complex relationships, including nonlinear and high-order interactions. Additionally, it is a non-parametric approach without the imposition of parametric and distributional assumptions. Eight simulation settings and one real-data analysis consistently demonstrate the excellent performance of the FunFor approach in various scenarios. In particular, FunFor successfully ranks the true predictors as the most important variables, while achieving the most robust variable sections and the smallest prediction errors when comparing it with three other relevant approaches. Although motivated by a biological leaf shape data analysis, the proposed FunFor approach has great potential to be widely applied in various fields due to its minimal requirement on tuning parameters and its distribution-free and model-free nature. An R package named ’FunFor’, implementing the FunFor approach, is available at GitHub.

 
more » « less
NSF-PAR ID:
10383769
Author(s) / Creator(s):
; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Reports
Volume:
11
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

     
    more » « less
  2. Abstract

    Satellite precipitation products, as all quantitative estimates, come with some inherent degree of uncertainty. To associate a quantitative value of the uncertainty to each individual estimate, error modeling is necessary. Most of the error models proposed so far compute the uncertainty as a function of precipitation intensity only, and only at one specific spatiotemporal scale. We propose a spectral error model that accounts for the neighboring space–time dynamics of precipitation into the uncertainty quantification. Systematic distortions of the precipitation signal and random errors are characterized distinctively in every frequency–wavenumber band in the Fourier domain, to accurately characterize error across scales. The systematic distortions are represented as a deterministic space–time linear filtering term. The random errors are represented as a nonstationary additive noise. The spectral error model is applied to the IMERG multisatellite precipitation product, and its parameters are estimated empirically through a system identification approach using the GV-MRMS gauge–radar measurements as reference (“truth”) over the eastern United States. The filtering term is found to be essentially low-pass (attenuating the fine-scale variability). While traditional error models attribute most of the error variance to random errors, it is found here that the systematic filtering term explains 48% of the error variance at the native resolution of IMERG. This fact confirms that, at high resolution, filtering effects in satellite precipitation products cannot be ignored, and that the error cannot be represented as a purely random additive or multiplicative term. An important consequence is that precipitation estimates derived from different sources shall not be expected to automatically have statistically independent errors.

    Significance Statement

    Satellite precipitation products are nowadays widely used for climate and environmental research, water management, risk analysis, and decision support at the local, regional, and global scales. For all these applications, knowledge about the accuracy of the products is critical for their usability. However, products are not systematically provided with a quantitative measure of the uncertainty associated with each individual estimate. Various parametric error models have been proposed for uncertainty quantification, mostly assuming that the uncertainty is only a function of the precipitation intensity at the pixel and time of interest. By projecting satellite precipitation fields and their retrieval errors into the Fourier frequency–wavenumber domain, we show that we can explicitly take into account the neighboring space–time multiscale dynamics of precipitation and compute a scale-dependent uncertainty.

     
    more » « less
  3. Abstract

    Traits differentially adapt plant species to particular conditions generating compositional shifts along environmental gradients. As a result, community‐scale trait values show concomitant shifts, termed trait‒environment relationships. Trait‒environment relationships are often assessed by evaluating community‐weighted mean (CWM) traits observed along environmental gradients. Regression‐based approaches (CWMr) assume that local communities exhibit traits centred at a single optimum value and that traits do not covary meaningfully. Evidence suggests that the shape of trait‒abundance relationships can vary widely along environmental gradients—reflecting complex interactions—and traits are usually interrelated. We used a model that accounts for these factors to explore trait‒environment relationships in herbaceous forest plant communities in Wisconsin (USA).

    We built a generalized linear mixed model (GLMM) to analyse how abundances of 185 species distributed among 189 forested sites vary in response to four functional traits (vegetative height—VH, leaf size—LS, leaf mass per area—LMA and leaf carbon content), six environmental variables describing overstorey, soil and climate conditions, and their interactions. The GLMM allowed us to assess the nature and relative strength of the resulting 24 trait‒environment relationships. We also compared results between GLMM and CWMr to explore how conclusions differ between approaches.

    The GLMM identified five significant trait‒environment relationships that together explain ~40% of variation in species abundances across sites. Temperature appeared as a key environmental driver, with warmer and more seasonal sites favouring taller plants. Soil texture and temperature seasonality affected LS and LMA; seasonality effects on LS and LMA were nonlinear, declining at more seasonal sites. Although often assumed for CWMr, only some traits under certain conditions had centred optimum trait‒abundance relationships. CWMr more liberally identified (13) trait‒environment relationships as significant but failed to detect the temperature seasonality‒LMA relationship identified by the GLMM.

    Synthesis. Although GLMM represents a more methodologically complex approach than CWMr, it identified a reduced set of trait‒environment relationships still capable of accounting for the responses of forest understorey herbs to environmental gradients. It also identified separate effects of mean and seasonal temperature on LMA that appear important in these forests, generating useful insights and supporting broader application of GLMM approach to understand trait‒environment relationships.

     
    more » « less
  4. Background

    Although conventional prediction models for surgical patients often ignore intraoperative time-series data, deep learning approaches are well-suited to incorporate time-varying and non-linear data with complex interactions. Blood lactate concentration is one important clinical marker that can reflect the adequacy of systemic perfusion during cardiac surgery. During cardiac surgery and cardiopulmonary bypass, minute-level data is available on key parameters that affect perfusion. The goal of this study was to use machine learning and deep learning approaches to predict maximum blood lactate concentrations after cardiac surgery. We hypothesized that models using minute-level intraoperative data as inputs would have the best predictive performance.

    Methods

    Adults who underwent cardiac surgery with cardiopulmonary bypass were eligible. The primary outcome was maximum lactate concentration within 24 h postoperatively. We considered three classes of predictive models, using the performance metric of mean absolute error across testing folds: (1) static models using baseline preoperative variables, (2) augmentation of the static models with intraoperative statistics, and (3) a dynamic approach that integrates preoperative variables with intraoperative time series data.

    Results

    2,187 patients were included. For three models that only used baseline characteristics (linear regression, random forest, artificial neural network) to predict maximum postoperative lactate concentration, the prediction error ranged from a median of 2.52 mmol/L (IQR 2.46, 2.56) to 2.58 mmol/L (IQR 2.54, 2.60). The inclusion of intraoperative summary statistics (including intraoperative lactate concentration) improved model performance, with the prediction error ranging from a median of 2.09 mmol/L (IQR 2.04, 2.14) to 2.12 mmol/L (IQR 2.06, 2.16). For two modelling approaches (recurrent neural network, transformer) that can utilize intraoperative time-series data, the lowest prediction error was obtained with a range of median 1.96 mmol/L (IQR 1.87, 2.05) to 1.97 mmol/L (IQR 1.92, 2.05). Intraoperative lactate concentration was the most important predictive feature based on Shapley additive values. Anemia and weight were also important predictors, but there was heterogeneity in the importance of other features.

    Conclusion

    Postoperative lactate concentrations can be predicted using baseline and intraoperative data with moderate accuracy. These results reflect the value of intraoperative data in the prediction of clinically relevant outcomes to guide perioperative management.

     
    more » « less
  5. Windecker, Saras (Ed.)
    1. The ecological and environmental science communities have embraced machine learning (ML) for empirical modelling and prediction. However, going beyond prediction to draw insights into underlying functional relationships between response variables and environmental ‘drivers’ is less straightforward. Deriving ecological insights from fitted ML models requires techniques to extract the ‘learning’ hidden in the ML models. 2. We revisit the theoretical background and effectiveness of four approaches for deriving insights from ML: ranking independent variable importance (Gini importance, GI; permutation importance, PI; split importance, SI; and conditional permutation importance, CPI), and two approaches for inference of bivariate functional relationships (partial dependence plots, PDP; and accumulated local effect plots, ALE). We also explore the use of a surrogate model for visualization and interpretation of complex multi-variate relationships between response variables and environmental drivers. We examine the challenges and opportunities for extracting ecological insights with these interpretation approaches. Specifically, we aim to improve interpretation of ML models by investigating how effectiveness relates to (a) interpretation algorithm, (b) sample size and (c) the presence of spurious explanatory variables. 3. We base the analysis on simulations with known underlying functional relationships between response and predictor variables, with added white noise and the presence of correlated but non-influential variables. The results indicate that deriving ecological insight is strongly affected by interpretation algorithm and spurious variables, and moderately impacted by sample size. Removing spurious variables improves interpretation of ML models. Meanwhile, increasing sample size has limited value in the presence of spurious variables, but increasing sample size does improves performance once spurious variables are omitted. Among the four ranking methods, SI is slightly more effective than the other methods in the presence of spurious variables, while GI and SI yield higher accuracy when spurious variables are removed. PDP is more effective in retrieving underlying functional relationships than ALE, but its reliability declines sharply in the presence of spurious variables. Visualization and interpretation of the interactive effects of predictors and the response variable can be enhanced using surrogate models, including three-dimensional visualizations and use of loess planes to represent independent variable effects and interactions. 4. Machine learning analysts should be aware that including correlated independent variables in ML models with no clear causal relationship to response variables can interfere with ecological inference. When ecological inference is important, ML models should be constructed with independent variables that have clear causal effects on response variables. While interpreting ML models for ecological inference remains challenging, we show that careful choice of interpretation methods, exclusion of spurious variables and adequate sample size can provide more and better opportunities to ‘learn from machine learning’. 
    more » « less