skip to main content


Title: Provable Boolean interaction recovery from tree ensemble obtained via random forests
Random Forests (RFs) are at the cutting edge of supervised machine learning in terms of prediction performance, especially in genomics. Iterative RFs (iRFs) use a tree ensemble from iteratively modified RFs to obtain predictive and stable nonlinear or Boolean interactions of features. They have shown great promise for Boolean biological interaction discovery that is central to advancing functional genomics and precision medicine. However, theoretical studies into how tree-based methods discover Boolean feature interactions are missing. Inspired by the thresholding behavior in many biological processes, we first introduce a discontinuous nonlinear regression model, called the “Locally Spiky Sparse” (LSS) model. Specifically, the LSS model assumes that the regression function is a linear combination of piecewise constant Boolean interaction terms. Given an RF tree ensemble, we define a quantity called “Depth-Weighted Prevalence” (DWP) for a set of signed features S ± . Intuitively speaking, DWP( S ± ) measures how frequently features in S ± appear together in an RF tree ensemble. We prove that, with high probability, DWP( S ± ) attains a universal upper bound that does not involve any model coefficients, if and only if S ± corresponds to a union of Boolean interactions under the LSS model. Consequentially, we show that a theoretically tractable version of the iRF procedure, called LSSFind, yields consistent interaction discovery under the LSS model as the sample size goes to infinity. Finally, simulation results show that LSSFind recovers the interactions under the LSS model, even when some assumptions are violated.  more » « less
Award ID(s):
1741340 2015341
NSF-PAR ID:
10374034
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
119
Issue:
22
ISSN:
0027-8424
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library. 
    more » « less
  2. The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study “The Michigan Genomics Initiative”, we present two examples of interaction analysis in support of our results.

     
    more » « less
  3. Machine learning (ML) methods, such as artificial neural networks (ANN), k-nearest neighbors (kNN), random forests (RF), support vector machines (SVM), and boosted decision trees (DTs), may offer stronger predictive performance than more traditional, parametric methods, such as linear regression, multiple linear regression, and logistic regression (LR), for specific mapping and modeling tasks. However, this increased performance is often accompanied by increased model complexity and decreased interpretability, resulting in critiques of their “black box” nature, which highlights the need for algorithms that can offer both strong predictive performance and interpretability. This is especially true when the global model and predictions for specific data points need to be explainable in order for the model to be of use. Explainable boosting machines (EBM), an augmentation and refinement of generalize additive models (GAMs), has been proposed as an empirical modeling method that offers both interpretable results and strong predictive performance. The trained model can be graphically summarized as a set of functions relating each predictor variable to the dependent variable along with heat maps representing interactions between selected pairs of predictor variables. In this study, we assess EBMs for predicting the likelihood or probability of slope failure occurrence based on digital terrain characteristics in four separate Major Land Resource Areas (MLRAs) in the state of West Virginia, USA and compare the results to those obtained with LR, kNN, RF, and SVM. EBM provided predictive accuracies comparable to RF and SVM and better than LR and kNN. The generated functions and visualizations for each predictor variable and included interactions between pairs of predictor variables, estimation of variable importance based on average mean absolute scores, and provided scores for each predictor variable for new predictions add interpretability, but additional work is needed to quantify how these outputs may be impacted by variable correlation, inclusion of interaction terms, and large feature spaces. Further exploration of EBM is merited for geohazard mapping and modeling in particular and spatial predictive mapping and modeling in general, especially when the value or use of the resulting predictions would be greatly enhanced by improved interpretability globally and availability of prediction explanations at each cell or aggregating unit within the mapped or modeled extent. 
    more » « less
  4. Abstract

    Seismological constraints obtained from receiver function (RF) analysis provide important information about the crust and mantle structure. Here, we explore the utility of the free‐surface multiple of the P‐wave (PP) and the corresponding conversions in RF analysis. Using earthquake records, we demonstrate the efficacy of PPs‐RFs before illustrating how they become especially useful when limited data is available in typical planetary missions. Using a transdimensional hierarchical Bayesian deconvolution approach, we compute robust P‐to‐S (Ps)‐ and PPs‐RFs withInSightrecordings of five marsquakes. Our Ps‐RF results verify the direct Ps converted phases reported by previous RF analyses with increased coherence and reveal other phases including the primary multiple reverberating within the uppermost layer of the Martian crust. Unlike the Ps‐RFs, our PPs‐RFs lack an arrival at 7.2 s lag time. Whereas Ps‐RFs on Mars could be equally well fit by a two‐ or three‐layer crust, synthetic modeling shows that the disappearance of the 7.2 s phase requires a three‐layer crust, and is highly sensitive to velocity and thickness of intra‐crustal layers. We show that a three‐layer crust is also preferred by S‐to‐P (Sp)‐RFs. While the deepest interface of the three‐layer crust represents the crust‐mantle interface beneath theInSightlanding site, the other two interfaces at shallower depths could represent a sharp transition between either fractured and unfractured materials or thick basaltic flows and pre‐existing crustal materials. PPs‐RFs can provide complementary constraints and maximize the extraction of information about crustal structure in data‐constrained circumstances such as planetary missions.

     
    more » « less
  5. Summary

    Mexico has a complex geological history that is typified by the distinctive terranes that are found in the south-central region. Crustal thickness variations often correlate with geological terranes that have been altered by several processes in the past, for example aerial or subduction erosion, underplating volcanic material or rifting but few geophysical studies have locally imaged the entire continental crust in Mexico. In this paper, the thickness of three layers of the crust in south-central Mexico is determined. To do this, we use P- and S-wave receiver functions (RF) from 159 seismological broad-band stations. Thanks to its adaptive nature, we use an empirical mode decomposition (EMD) algorithm to reconstruct the RFs into intrinsic mode functions (IMF) in order to enhance the pulses related to internal discontinuities within the crust. To inspect possible lateral variations, the RFs are grouped into quadrants of 90°, and their amplitudes are mapped into the thickness assuming a three-layer model. Using this approach, we identify a shallow sedimentary layer with a thickness in the range of 1–4 km. The upper-crust was estimated to be of a few kilometers (<10 km) thick near the Pacific coast, and thicker, approximately 15 km in central Oaxaca and under the Trans-Mexican Volcanic Belt (TMVB). Close to the Pacific coast, we infer a thin crust of approximately 16 ± 0.9 km, while in central Oaxaca and beneath the TMVB, we observe a thicker crust ranging between 30 and 50 km ± 2.0 km. We observe a crustal thinning, of approximately 6 km, from central Oaxaca (37 ± 1.9 km) towards the Gulf of Mexico, under the Veracruz Basin, where we estimate a crustal thickness of 31.6 ± 1.9 km. The boundary between the upper and lower crust in comparison with the surface of the Moho do not show significant variations other than the depth difference. We observe small crustal variations across the different terranes on the study area, with the thinnest crust located at the Pacific coast and Gulf of Mexico coast. The thickest crust is estimated to be in central Oaxaca and beneath the TMVB.

     
    more » « less