skip to main content


Title: Functional random forests for curve response
Abstract

The rapid advancement of functional data in various application fields has increased the demand for advanced statistical approaches that can incorporate complex structures and nonlinear associations. In this article, we propose a novel functional random forests (FunFor) approach to model the functional data response that is densely and regularly measured, as an extension of the landmark work of Breiman, who introduced traditional random forests for a univariate response. The FunFor approach is able to predict curve responses for new observations and selects important variables from a large set of scalar predictors. The FunFor approach inherits the efficiency of the traditional random forest approach in detecting complex relationships, including nonlinear and high-order interactions. Additionally, it is a non-parametric approach without the imposition of parametric and distributional assumptions. Eight simulation settings and one real-data analysis consistently demonstrate the excellent performance of the FunFor approach in various scenarios. In particular, FunFor successfully ranks the true predictors as the most important variables, while achieving the most robust variable sections and the smallest prediction errors when comparing it with three other relevant approaches. Although motivated by a biological leaf shape data analysis, the proposed FunFor approach has great potential to be widely applied in various fields due to its minimal requirement on tuning parameters and its distribution-free and model-free nature. An R package named ’FunFor’, implementing the FunFor approach, is available at GitHub.

 
more » « less
NSF-PAR ID:
10383769
Author(s) / Creator(s):
; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Reports
Volume:
11
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data. 
    more » « less
  2. Abstract

    As researchers collect large amounts of data in the social sciences through household surveys, challenges may arise in how best to analyze such datasets, especially where motivating theories are unclear or conflicting. New analytical methods may be necessary to extract information from these datasets. Machine learning techniques are promising methods for identifying patterns in large datasets, but have not yet been widely used to identify important variables in social surveys with many questions. To demonstrate the potential of machine learning to analyze large social datasets, we apply machine learning techniques to the study of migration in Bangladesh. The complexity of migration decisions makes them suitable for analysis with machine learning techniques, which enable pattern identification in large datasets with many covariates. In this paper, we apply random forest methods to analyzing a large survey which captures approximately 2000 variables from approximately 1700 households in southwestern Bangladesh. Our analysis ranked the covariates in the dataset in terms of their predictive power for migration decisions. The results identified the most important covariates, but there exists a tradeoff between predictive ability and interpretability. To address this tradeoff, random forests and other machine learning algorithms may be especially useful in combination with more traditional regression methods. To develop insights into how the important variables identified by the random forest algorithm impact migration, we performed a survival analysis of household time to first migration. With this combined analysis, we found that variables related to wealth and household composition are important predictors of migration. Such multi-methods approaches may help to shed light on factors contributing to migration and non-migration.

     
    more » « less
  3. Abstract

    As modeling tools and approaches become more advanced, ecological models are becoming more complex. Traditional sensitivity analyses can struggle to identify the nonlinearities and interactions emergent from such complexity, especially across broad swaths of parameter space. This limits understanding of the ecological mechanisms underlying model behavior. Machine learning approaches are a potential answer to this issue, given their predictive ability when applied to complex large datasets. While perceptions that machine learning is a “black box” linger, we seek to illuminate its interpretive potential in ecological modeling. To do so, we detail our process of applying random forests to complex model dynamics to produce both high predictive accuracy and elucidate the ecological mechanisms driving our predictions. Specifically, we employ an empirically rooted ontogenetically stage-structured consumer-resource simulation model. Using simulation parameters as feature inputs and simulation output as dependent variables in our random forests, we extended feature analyses into a simple graphical analysis from which we reduced model behavior to three core ecological mechanisms. These ecological mechanisms reveal the complex interactions between internal plant demography and trophic allocation driving community dynamics while preserving the predictive accuracy achieved by our random forests.

     
    more » « less
  4. Abstract

    Accurately predicting species' range shifts in response to environmental change is paramount for understanding ecological processes and global change. In synthetic analyses, traits emerge as significant but weak predictors of species' range shifts across recent climate change. These studies assume linear responses to traits, while detailed empirical work often reveals trait responses that are unimodal and contain thresholds or other nonlinearities. We hypothesize that the use of linear modeling approaches fails to capture these nonlinearities and, therefore, may be under‐powering traits to predict range shifts. We evaluate the predictive performance of approaches that can capture nonlinear relationships (ridge‐regularized linear regression, support vector regression with linear and nonlinear kernels, and random forests). We apply our models using six multidecadal range shift datasets for plants, moths, marine fish, birds, and small mammals. We show that nonlinear approaches can perform better than least‐squares linear modeling in reproducing historical range shifts. Consistent with expectations, we identify dispersal and climatic niche traits as primary determinants of distribution shifts. Traits identified as important predictors and the direction of trait effects are generally consistent across models, but there are notable exceptions. Among important predictors, there are more consistent responses to climatic niches than dispersal ability. Modest improvements in predictability when accounting for nonlinearities and interactions, and the overall low amount of variance accounted for by trait predictors suggest limits to trait‐based statistical predictive frameworks.

     
    more » « less
  5. null (Ed.)
    In this paper, we study efficient approaches to reachability analysis for discrete-time nonlinear dynamical systems when the dependencies among the variables of the system have low treewidth. Reachability analysis over nonlinear dynamical systems asks if a given set of target states can be reached, starting from an initial set of states. This is solved by computing conservative over approximations of the reachable set using abstract domains to represent these approximations. However, most approaches must tradeoff the level of conservatism against the cost of performing analysis, especially when the number of system variables increases. This makes reachability analysis challenging for nonlinear systems with a large number of state variables. Our approach works by constructing a dependency graph among the variables of the system. The tree decomposition of this graph builds a tree wherein each node of the tree is labeled with subsets of the state variables of the system. Furthermore, the tree decomposition satisfies important structural properties. Using the tree decomposition, our approach abstracts a set of states of the high dimensional system into a tree of sets of lower dimensional projections of this state. We derive various properties of this abstract domain, including conditions under which the original high dimensional set can be fully recovered from its low dimensional projections. Next, we use ideas from message passing developed originally for belief propagation over Bayesian networks to perform reachability analysis over the full state space in an efficient manner. We illustrate our approach on some interesting nonlinear systems with low treewidth to demonstrate the advantages of our approach. 
    more » « less