This content will become publicly available on May 1, 2025
Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
more » « less- Award ID(s):
- 2100729
- PAR ID:
- 10530918
- Publisher / Repository:
- EBSCO
- Date Published:
- Journal Name:
- Journal of Data Science
- ISSN:
- 1680-743X
- Page Range / eLocation ID:
- 259 to 279
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim to take a step forward in this direction by demonstrating that the additional randomness injected into individual trees serves as a form of implicit regularization, making random forests an ideal model in low signal-to-noise ratio (SNR) settings. Specifically, from a model-complexity perspective, we show that the mtry parameter in random forests serves much the same purpose as the shrinkage penalty in explicitly regularized regression procedures like lasso and ridge regression. To highlight this point, we design a randomized linear-model-based forward selection procedure intended as an analogue to tree-based random forests and demonstrate its surprisingly strong empirical performance. Numerous demonstrations on both real and synthetic data are provided.more » « less
-
The objective of this study is to develop data-driven predictive models for seismic energy dissipation of rocking shallow foundations during earthquake loading using decision tree-based ensemble machine learning algorithms and supervised learning technique. Data from a rocking foundation’s database consisting of dynamic base shaking experiments conducted on centrifuges and shaking tables have been used for the development of a base decision tree regression (DTR) model and four ensemble models: bagging, random forest, adaptive boosting, and gradient boosting. Based on k-fold cross-validation tests of models and mean absolute percentage errors in predictions, it is found that the overall average accuracy of all four ensemble models is improved by about 25%–37% when compared to base DTR model. Among the four ensemble models, gradient boosting and adaptive boosting models perform better than the other two models in terms of accuracy and variance in predictions for the problem considered.more » « less
-
Abstract This paper introduces a study on the classification of aortic stenosis (AS) based on cardio-mechanical signals collected using non-invasive wearable inertial sensors. Measurements were taken from 21 AS patients and 13 non-AS subjects. A feature analysis framework utilizing Elastic Net was implemented to reduce the features generated by continuous wavelet transform (CWT). Performance comparisons were conducted among several machine learning (ML) algorithms, including decision tree, random forest, multi-layer perceptron neural network, and extreme gradient boosting. In addition, a two-dimensional convolutional neural network (2D-CNN) was developed using the CWT coefficients as images. The 2D-CNN was made with a custom-built architecture and a CNN based on Mobile Net via transfer learning. After the reduction of features by 95.47%, the results obtained report 0.87 on accuracy by decision tree, 0.96 by random forest, 0.91 by simple neural network, and 0.95 by XGBoost. Via the 2D-CNN framework, the transfer learning of Mobile Net shows an accuracy of 0.91, while the custom-constructed classifier reveals an accuracy of 0.89. Our results validate the effectiveness of the feature selection and classification framework. They also show a promising potential for the implementation of deep learning tools on the classification of AS.
-
This paper demonstrated how to apply Machine Learning (ML) techniques to analyze student interaction data collected in an online mathematics game. Using a data-driven approach, we examined 1) how different ML algorithms influenced the precision of middle-school students’ (N = 359) performance (i.e. posttest math knowledge scores) prediction and 2) what types of in-game features (i.e. student in-game behaviors, math anxiety, mathematical strategies) were associated with student math knowledge scores. The results indicated that the Random Forest algorithm showed the best performance (i.e. the accuracy of models, error measures) in predicting posttest math knowledge scores among the seven algorithms employed. Out of 37 features included in the model, the validity of the students’ first mathematical transformation was the most predictive of their posttest math knowledge scores. Implications for game learning analytics and supporting students’ algebraic learning are discussed based on the findings.more » « less
-
Previous research into trust dynamics in human-autonomy interaction has demonstrated that individuals exhibit specific patterns of trust when interacting repeatedly with automated systems. Moreover, people with different types of trust dynamics have been shown to differ across seven personal characteristic dimensions: masculinity, positive affect, extraversion, neuroticism, intellect, performance expectancy, and high expectations. In this study, we develop classification models aimed at predicting an individual’s trust dynamics type–categorized as Bayesian decision-maker, disbeliever, or oscillator–based on these key dimensions. We employed multiple classification algorithms including the random forest classifier, multinomial logistic regression, Support Vector Machine, XGBoost, and Naive Bayes, and conducted a comparative evaluation of their performance. The results indicate that personal characteristics can effectively predict the type of trust dynamics, achieving an accuracy rate of 73.1%, and a weighted average F1 score of 0.64. This study underscores the predictive power of personal traits in the context of human-autonomy interaction.