Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.
more »
« less
Provable Boolean interaction recovery from tree ensemble obtained via random forests
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S ± . Intuitively speaking, DWP( S ± ) measures how frequently features in S ± appear together in an RF tree ensemble. We prove that, with high probability, DWP( S ± ) attains a universal upper bound that does not involve any model coefficients, if and only if S ± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.
more »
« less
- PAR ID:
- 10374034
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 119
- Issue:
- 22
- ISSN:
- 0027-8424
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.more » « less
-
The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study “The Michigan Genomics Initiative”, we present two examples of interaction analysis in support of our results.more » « less
-
Discher, Dennis (Ed.)Abstract Accurate positioning of the mitotic spindle within the rounded cell body is critical to physiological maintenance. Mitotic cells encounter confinement from neighboring cells or the extracellular matrix (ECM), which can cause rotation of mitotic spindles and tilting of the metaphase plate (MP). To understand the effect of confinement on mitosis by fibers (ECM confinement), we use flexible ECM-mimicking nanofibers that allow natural rounding of the cell body while confining it to differing levels. Rounded mitotic bodies are anchored in place by actin retraction fibers (RFs) originating from adhesions on fibers. We discover that the extent of confinement influences RF organization in 3D, forming triangular and band-like patterns on the cell cortex under low and high confinement, respectively. Our mechanistic analysis reveals that the patterning of RFs on the cell cortex is the primary driver of the MP rotation. A stochastic Monte Carlo simulation of the centrosome, chromosome, membrane interactions, and 3D arrangement of RFs recovers MP tilting trends observed experimentally. Under high ECM confinement, the fibers can mechanically pinch the cortex, causing the MP to have localized deformations at contact sites with fibers. Interestingly, high ECM confinement leads to low and high MP tilts, which we mechanistically show to depend upon the extent of cortical deformation, RF patterning, and MP position. We identify that cortical deformation and RFs work in tandem to limit MP tilt, while asymmetric positioning of MP leads to high tilts. Overall, we provide fundamental insights into how mitosis may proceed in ECM-confining microenvironments in vivo.more » « less
-
Biological systems contain a large number of molecules that have diverse interactions. A fruitful path to understanding these systems is to represent them with interaction networks, and then describe flow processes in the network with a dynamic model. Boolean modeling, the simplest discrete dynamic modeling framework for biological networks, has proven its value in recapitulating experimental results and making predictions. A first step and major roadblock to the widespread use of Boolean networks in biology is the laborious network inference and construction process. Here we present a streamlined network inference method that combines the discovery of a parsimonious network structure and the identification of Boolean functions that determine the dynamics of the system. This inference method is based on a causal logic analysis method that associates a logic type (sufficient or necessary) to node-pair relationships (whether promoting or inhibitory). We use the causal logic framework to assimilate indirect information obtained from perturbation experiments and infer relationships that have not yet been documented experimentally. We apply this inference method to a well-studied process of hormone signaling in plants, the signaling underlying abscisic acid (ABA)—induced stomatal closure. Applying the causal logic inference method significantly reduces the manual work typically required for network and Boolean model construction. The inferred model agrees with the manually curated model. We also test this method by re-inferring a network representing epithelial to mesenchymal transition based on a subset of the information that was initially used to construct the model. We find that the inference method performs well for various likely scenarios of inference input information. We conclude that our method is an effective approach toward inference of biological networks and can become an efficient step in the iterative process between experiments and computations.more » « less
An official website of the United States government

