Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
more »
« less
This content will become publicly available on January 13, 2026
HighDimMixedModels.jl: Robust high-dimensional mixed-effects models across omics data
High-dimensional mixed-effects models are an increasingly important form of regression in which the number of covariates rivals or exceeds the number of samples, which are collected in groups or clusters. The penalized likelihood approach to fitting these models relies on a coordinate descent algorithm that lacks guarantees of convergence to a global optimum. Here, we empirically study the behavior of this algorithm on simulated and real examples of three types of data that are common in modern biology: transcriptome, genome-wide association, and microbiome data. Our simulations provide new insights into the algorithm’s behavior in these settings, and, comparing the performance of two popular penalties, we demonstrate that the smoothly clipped absolute deviation (SCAD) penalty consistently outperforms the least absolute shrinkage and selection operator (LASSO) penalty in terms of both variable selection and estimation accuracy across omics data. To empower researchers in biology and other fields to fit models with the SCAD penalty, we implement the algorithm in a Julia package,HighDimMixedModels.jl.
more »
« less
- Award ID(s):
- 2144367
- PAR ID:
- 10627097
- Editor(s):
- Birtwistle, Marc R
- Publisher / Repository:
- PLOS Computational Biology
- Date Published:
- Journal Name:
- PLOS Computational Biology
- Volume:
- 21
- Issue:
- 1
- ISSN:
- 1553-7358
- Page Range / eLocation ID:
- e1012143
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider the problem of inferring the conditional independence graph (CIG) of high-dimensional Gaussian vectors from multi-attribute data. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar random variable with each node. In multi-attribute graphical models, each node represents a random vector. In this paper we consider a sparse-group smoothly clipped absolute deviation (SG-SCAD) penalty instead of sparse-group lasso (SGL) penalty to regularize the problem. We analyze an SG-SCAD-penalized log-likelihood based objective function to establish consistency of a local estimator of inverse covariance. A numerical example is presented to illustrate the advantage of SG-SCAD-penalty over the usual SGL-penalty.more » « less
-
null (Ed.)We consider the problem of estimating the conditional independence graph (CIG) of a sparse, high-dimensional proper complex-valued Gaussian graphical model (CGGM). For CGGMs, the problem reduces to estimation of the inverse covariance matrix with more unknowns than the sample size. We consider a smoothly clipped absolute deviation (SCAD) penalty instead of the ℓ 1 -penalty to regularize the problem, and analyze a SCAD-penalized log-likelihood based objective function to establish consistency and sparsistency of a local estimator of inverse covariance in a neighborhood of the true value. A numerical example is presented to illustrate the advantage of SCAD-penalty over the usual ℓ 1 -penalty.more » « less
-
Abstract Four statistical selection methods for inferring transcription factor (TF)–target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF–TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF–TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF–TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.more » « less
-
We consider the problem of inferring the conditional independence graph (CIG) of high-dimensional Gaussian vectors from multi-attribute data. Most existing methods for graph estimation are based on single-attribute models where one associates a scalar random variable with each node. In multi-attribute graphical models, each node represents a random vector. In this paper we provide a unified theoretical analysis of multi-attribute graph learning using a penalized log-likelihood objective function. We consider both convex (sparse-group lasso) and sparse-group non-convex (log-sum and smoothly clipped absolute deviation (SCAD) penalties) penalty/regularization functions. An alternating direction method of multipliers (ADMM) approach coupled with local linear approximation to non-convex penalties is presented for optimization of the objective function. For non-convex penalties, theoretical analysis establishing local consistency in support recovery, local convexity and precision matrix estimation in high-dimensional settings is provided under two sets of sufficient conditions: with and without some irrepresentability conditions. We illustrate our approaches using both synthetic and real-data numerical examples. In the synthetic data examples the sparse-group log-sum penalized objective function significantly outperformed the lasso penalized as well as SCAD penalized objective functions with F1 -score and Hamming distance as performance metrics.more » « less
An official website of the United States government
