skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Interaction Selection and Prediction Performance in High-Dimensional Data: A Comparative Study of Statistical and Tree-Based Methods
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.  more » « less
Award ID(s):
2100729
PAR ID:
10530918
Author(s) / Creator(s):
; ;
Publisher / Repository:
EBSCO
Date Published:
Journal Name:
Journal of Data Science
ISSN:
1680-743X
Page Range / eLocation ID:
259 to 279
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Birtwistle, Marc R (Ed.)
    High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm’s behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package,HighDimMixedModels.jl. 
    more » « less
  2. null (Ed.)
    Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided. 
    more » « less
  3. To address the growing challenges posed by Cyber threats, anti-malware organizations have increasingly turned to Machine Learning (ML). In recent years, machine learning algorithms have become indispensable for solving complex classification problems, outperforming traditional statistical methods by capturing intricate patterns in high dimensional data. However, selecting the optimal model requires rigorous evaluation in multiple performance metrics while ensuring stability across different data splits. In this study, we conducted a comprehensive assessment of eight machine learning algorithms. Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), Logistic Regression (LR), Naive Bayes, Light Gradient Boosting Machine (LightGBM), Decision Tree (DT), and k-Nearest Neighbors (KNN) using stratified 5-fold cross-validation. Our results reveal that RF, LightGBM, DT, and KNN achieve exceptional performance, with identical near-perfect scores in accuracy (0.9918), precision (0.9920), recall (0.9918), F1 score (0.9918) and Area Under the Receiver Operation Characteristic Curve (AUC-ROC) (0.9998), along with remarkably low variance (10−6 to 10−8), demonstrating unparalleled robustness. The study highlights the superiority of tree-based ensembles and KNN in achieving high predictive power and stability, whereas classical algorithms such as logistic regression and naive Bayes lag. Despite XGBoost’s reputation, its performance here is eclipsed by simpler tree-based methods. Our analysis underscores the importance of considering variance when evaluating model selection, particularly for critical applications where stability is paramount, and provides actionable insights for practitioners seeking reliable, high-accuracy classifiers. 
    more » « less
  4. The objective of this study is to develop data-driven predictive models for seismic energy dissipation of rocking shallow foundations during earthquake loading using decision tree-based ensemble machine learning algorithms and supervised learning technique. Data from a rocking foundation’s database consisting of dynamic base shaking experiments conducted on centrifuges and shaking tables have been used for the development of a base decision tree regression (DTR) model and four ensemble models: bagging, random forest, adaptive boosting, and gradient boosting. Based on k-fold cross-validation tests of models and mean absolute percentage errors in predictions, it is found that the overall average accuracy of all four ensemble models is improved by about 25%–37% when compared to base DTR model. Among the four ensemble models, gradient boosting and adaptive boosting models perform better than the other two models in terms of accuracy and variance in predictions for the problem considered. 
    more » « less
  5. Previous research into trust dynamics in human-autonomy interaction has demonstrated that individuals exhibit specific patterns of trust when interacting repeatedly with automated systems. Moreover, people with different types of trust dynamics have been shown to differ across seven personal characteristic dimensions: masculinity, positive affect, extraversion, neuroticism, intellect, performance expectancy, and high expectations. In this study, we develop classification models aimed at predicting an individual’s trust dynamics type–categorized as Bayesian decision-maker, disbeliever, or oscillator–based on these key dimensions. We employed multiple classification algorithms including the random forest classifier, multinomial logistic regression, Support Vector Machine, XGBoost, and Naive Bayes, and conducted a comparative evaluation of their performance. The results indicate that personal characteristics can effectively predict the type of trust dynamics, achieving an accuracy rate of 73.1%, and a weighted average F1 score of 0.64. This study underscores the predictive power of personal traits in the context of human-autonomy interaction. 
    more » « less