skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.  more » « less
Award ID(s):
2100729
PAR ID:
10530918
Author(s) / Creator(s):
; ;
Publisher / Repository:
EBSCO
Date Published:
Journal Name:
Journal of Data Science
ISSN:
1680-743X
Page Range / eLocation ID:
259 to 279
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Birtwistle, Marc R (Ed.)
    High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm’s behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package,HighDimMixedModels.jl. 
    more » « less
  2. null (Ed.)
    Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided. 
    more » « less
  3. The objective of this study is to develop data-driven predictive models for seismic energy dissipation of rocking shallow foundations during earthquake loading using decision tree-based ensemble machine learning algorithms and supervised learning technique. Data from a rocking foundation’s database consisting of dynamic base shaking experiments conducted on centrifuges and shaking tables have been used for the development of a base decision tree regression (DTR) model and four ensemble models: bagging, random forest, adaptive boosting, and gradient boosting. Based on k-fold cross-validation tests of models and mean absolute percentage errors in predictions, it is found that the overall average accuracy of all four ensemble models is improved by about 25%–37% when compared to base DTR model. Among the four ensemble models, gradient boosting and adaptive boosting models perform better than the other two models in terms of accuracy and variance in predictions for the problem considered. 
    more » « less
  4. Abstract This paper introduces a study on the classification of aortic stenosis (AS) based on cardio-mechanical signals collected using non-invasive wearable inertial sensors. Measurements were taken from 21 AS patients and 13 non-AS subjects. A feature analysis framework utilizing Elastic Net was implemented to reduce the features generated by continuous wavelet transform (CWT). Performance comparisons were conducted among several machine learning (ML) algorithms, including decision tree, random forest, multi-layer perceptron neural network, and extreme gradient boosting. In addition, a two-dimensional convolutional neural network (2D-CNN) was developed using the CWT coefficients as images. The 2D-CNN was made with a custom-built architecture and a CNN based on Mobile Net via transfer learning. After the reduction of features by 95.47%, the results obtained report 0.87 on accuracy by decision tree, 0.96 by random forest, 0.91 by simple neural network, and 0.95 by XGBoost. Via the 2D-CNN framework, the transfer learning of Mobile Net shows an accuracy of 0.91, while the custom-constructed classifier reveals an accuracy of 0.89. Our results validate the effectiveness of the feature selection and classification framework. They also show a promising potential for the implementation of deep learning tools on the classification of AS. 
    more » « less
  5. Abstract. The coastal area of New Hanover County in North Carolina encompasses diverse wetland habitats influenced by unique coastal and tidal dynamics, with researchers examining the impacts of landscape changes, sea-level rise, and climate fluctuations on wetland health and biodiversity. This study integrates multispectral imagery data, LiDAR, and additional sources to enhance classification accuracy. The study also addresses binary classification for wetland and non-wetland classification and a multi-classification for different wetland classes, leveraging on the Random Forest algorithm which significantly improved the overall accuracy of wetland mapping. The Random Forest model’s performance in different scenarios was evaluated, with Scenario 1 achieving an overall accuracy of nearly 93.9%, Scenario 2 achieving an overall accuracy of 93.5%, Scenario 3 achieving an overall accuracy of 94.1%, and Scenario 4 achieving an overall accuracy of 88.2%. These results underscore the model’s effectiveness in accurately classifying coastal wetland areas under diverse remote sensing scenarios, highlighting its potential for practical applications in wetland mapping and ecological research. 
    more » « less