skip to main content


Search for: All records

Award ID contains: 2100729

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available July 31, 2025
  2. Predictive modeling often ignores interaction effects among predictors in high-dimensional data because of analytical and computational challenges. Research in interaction selection has been galvanized along with methodological and computational advances. In this study, we aim to investigate the performance of two types of predictive algorithms that can perform interaction selection. Specifically, we compare the predictive performance and interaction selection accuracy of both penalty-based and tree-based predictive algorithms. Penalty-based algorithms included in our comparative study are the regularization path algorithm under the marginality principle (RAMP), the least absolute shrinkage selector operator (LASSO), the smoothed clipped absolute deviance (SCAD), and the minimax concave penalty (MCP). The tree-based algorithms considered are random forest (RF) and iterative random forest (iRF). We evaluate the effectiveness of these algorithms under various regression and classification models with varying structures and dimensions. We assess predictive performance using the mean squared error for regression and accuracy, sensitivity, specificity, balanced accuracy, and F1 score for classification. We use interaction coverage to judge the algorithm’s efficacy for interaction selection. Our findings reveal that the effectiveness of the selected algorithms varies depending on the number of predictors (data dimension) and the structure of the data-generating model, i.e., linear or nonlinear, hierarchical or non-hierarchical. There were at least one or more scenarios that favored each of the algorithms included in this study. However, from the general pattern, we are able to recommend one or more specific algorithm(s) for some specific scenarios. Our analysis helps clarify each algorithm’s strengths and limitations, offering guidance to researchers and data analysts in choosing an appropriate algorithm for their predictive modeling task based on their data structure.

     
    more » « less
    Free, publicly-accessible full text available May 1, 2025
  3. Clustering is a fundamental tool for exploratory data analysis. One central problem in clustering is deciding if the clusters discovered by clustering methods are reliable as opposed to being artifacts of natural sampling variation. Statistical significance of clustering (SigClust) is a recently developed cluster evaluation tool for high-dimension, low-sample size data. Despite its successful application to many scientific problems, there are cases where the original SigClust may not work well. Furthermore, for specific applications, researchers may not have access to the original data and only have the dissimilarity matrix. In this case, clustering is still a valuable exploratory tool, but the original SigClust is not applicable. To address these issues, we propose a new SigClust method using multidimensional scaling (MDS). The underlying idea behind MDS-based SigClust is that one can achieve low-dimensional representations of the original data via MDS using only the dissimilarity matrix and then apply SigClust on the low-dimensional MDS space. The proposed MDS-based SigClust can circumvent the challenge of parameter estimation of the original method in high-dimensional spaces while keeping the essential clustering structure in the MDS space. Both simulations and real data applications demonstrate that the proposed method works remarkably well for assessing the statistical significance of clustering. Supplementary materials for this article are available online. 
    more » « less
    Free, publicly-accessible full text available January 2, 2025
  4. Multi-modal data are prevalent in many scientific fields. In this study, we consider the parameter estimation and variable selection for a multi-response regression using block-missing multi-modal data. Our method allows the dimensions of both the responses and the predictors to be large, and the responses to be incomplete and correlated, a common practical problem in high-dimensional settings. Our proposed method uses two steps to make a prediction from a multi-response linear regression model with block-missing multi-modal predictors. In the first step, without imputing missing data, we use all available data to estimate the covariance matrix of the predictors and the cross-covariance matrix between the predictors and the responses. In the second step, we use these matrices and a penalized method to simultaneously estimate the precision matrix of the response vector, given the predictors, and the sparse regression parameter matrix. Lastly, we demonstrate the effectiveness of the proposed method using theoretical studies, simulated examples, and an analysis of a multi-modal imaging data set from the Alzheimer’s Disease Neuroimaging Initiative. 
    more » « less
    Free, publicly-accessible full text available January 1, 2025
  5. Data driven individualized decision making problems have received a lot of attentions in recent years. In particular, decision makers aim to determine the optimal Individualized Treatment Rule (ITR) so that the expected speci ed outcome averaging over heterogeneous patient-speci c characteristics is maximized. Many existing methods deal with binary or a moderate number of treatment arms and may not take potential treatment e ect structure into account. However, the e ectiveness of these methods may deteriorate when the number of treatment arms becomes large. In this article, we propose GRoup Outcome Weighted Learning (GROWL) to estimate the latent structure in the treatment space and the op- timal group-structured ITRs through a single optimization. In particular, for estimating group-structured ITRs, we utilize the Reinforced Angle based Multicategory Support Vec- tor Machines (RAMSVM) to learn group-based decision rules under the weighted angle based multi-class classi cation framework. Fisher consistency, the excess risk bound, and the convergence rate of the value function are established to provide a theoretical guaran- tee for GROWL. Extensive empirical results in simulation studies and real data analysis demonstrate that GROWL enjoys better performance than several other existing methods. 
    more » « less
  6. In the realm of cybersecurity, intrusion detection systems (IDS) detect and prevent attacks based on collected computer and network data. In recent research, IDS models have been constructed using machine learning (ML) and deep learning (DL) methods such as Random Forest (RF) and deep neural networks (DNN). Feature selection (FS) can be used to construct faster, more interpretable, and more accurate models. We look at three different FS techniques; RF information gain (RF-IG), correlation feature selection using the Bat Algorithm (CFS-BA), and CFS using the Aquila Optimizer (CFS-AO). Our results show CFS-BA to be the most efficient of the FS methods, building in 55% of the time of the best RF-IG model while achieving 99.99% of its accuracy. This reinforces prior contributions attesting to CFS-BA’s accuracy while building upon the relationship between subset size, CFS score, and RF-IG score in final results. 
    more » « less
  7. Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method. 
    more » « less
  8. Change point analysis aims to detect structural changes in a data sequence. It has always been an active research area since it was introduced in the 1950s. In modern statistical applications, however, high-throughput data with increasing dimensions are ubiquitous in fields ranging from economics, finance to genetics and engineering. For those problems, the earlier works are typically no longer applicable. As a result, the problem of testing a change point for high dimensional data sequences has been an important yet challenging task. In this paper, we first focus on models for at most one change point, and review recent state-of-art techniques for change point testing of high dimensional mean vectors and compare their theoretical properties. Based on that, we provide a survey of some extensions to general high dimensional parameters beyond mean vectors as well as strategies for testing multiple change points in high dimensions. Finally, we discuss some open problems for possible future research directions. 
    more » « less
  9. In this article, we investigate the problem of simultaneous change point inference and structure recovery in the context of high dimensional Gaussian graphical models with possible abrupt changes. In particular, motivated by neighborhood selection, we incorporate a threshold variable and an unknown threshold parameter into a joint sparse regression model which combines p l1-regularized node-wise regression problems together. The change point estimator and the corresponding estimated coefficients of precision matrices are obtained together. Based on that, a classifier is introduced to distinguish whether a change point exists. To recover the graphical structure correctly, a data-driven thresholding procedure is proposed. In theory, under some sparsity conditions and regularity assumptions, our method can correctly choose a homogeneous or heterogeneous model with high accuracy. Furthermore, in the latter case with a change point, we establish estimation consistency of the change point estimator, by allowing the number of nodes being much larger than the sample size. Moreover, it is shown that, in terms of structure recovery of Gaussian graphical models, the proposed thresholding procedure achieves model selection consistency and controls the number of false positives. The validity of our proposed method is justified via extensive numerical studies. Finally, we apply our proposed method to the S&P 500 dataset to show its empirical usefulness. 
    more » « less