skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fast Lasso‐type safe screening for Fine‐Gray competing risks model with ultrahigh dimensional covariates
The Fine‐Gray proportional sub‐distribution hazards (PSH) model is among the most popular regression model for competing risks time‐to‐event data. This article develops a fast safe feature elimination method, named PSH‐SAFE, for fitting the penalized Fine‐Gray PSH model with a Lasso (or adaptive Lasso) penalty. Our PSH‐SAFE procedure is straightforward to implement, fast, and scales well to ultrahigh dimensional data. We also show that as a feature screening procedure, PSH‐SAFE is safe in a sense that the eliminated features are guaranteed to be inactive features in the original Lasso (or adaptive Lasso) estimator for the penalized PSH model. We evaluate the performance of the PSH‐SAFE procedure in terms of computational efficiency, screening efficiency and safety, run‐time, and prediction accuracy on multiple simulated datasets and a real bladder cancer data. Our empirical results show that the PSH‐SAFE procedure possesses desirable screening efficiency and safety properties and can offer substantially improved computational efficiency as well as similar or better prediction performance in comparison to their baseline competitors.  more » « less
Award ID(s):
2205441
PAR ID:
10533019
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Statistics in Medicine
Volume:
41
Issue:
24
ISSN:
0277-6715
Page Range / eLocation ID:
4941 to 4960
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method. 
    more » « less
  2. We propose a method for supervised learning with multiple sets of features (“views”). The multiview problem is especially important in biology and medicine, where “-omics” data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. “Cooperative learning” combines the usual squared-error loss of predictions with an “agreement” penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion. 
    more » « less
  3. ABSTRACT In modern statistical applications, identifying critical features in high‐dimensional data is essential for scientific discoveries. Traditional best subset selection methods face computational challenges, while regularization approaches such as Lasso, SCAD and their variants often exhibit poor performance with ultrahigh‐dimensional data. Sure screening methods, widely used for dimensionality reduction, have been developed as popular alternatives, but few target heavy‐tailed characteristics in modern big data. This paper introduces a new sure screening method, based on robust distance correlation (‘RDC’), designed for heavy‐tailed data. The proposed method inherits the benefits of the original model‐free distance correlation‐based screening while robustly estimating distance correlation in the presence of heavy‐tailed data. We further develop an FDR control procedure by incorporating the Reflection via Data Splitting (REDS) method. Extensive simulations demonstrate the method's advantage over existing screening procedures under different scenarios of heavy‐tailedness. Its application to high‐dimensional heavy‐tailed RNA‐seq data from The Cancer Genome Atlas (TCGA) pancreatic cancer cohort showcases superior performance in identifying biologically meaningful genes predictive of MAPK1 protein expression critical to pancreatic cancer. 
    more » « less
  4. Abstract Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this article, we show how to construct valid prediction sets for an ℓ1-penalized mixture of experts model in the high-dimensional setting. We make use of a debiasing procedure to account for the bias induced by the penalization and propose a novel strategy for combining intervals to form a prediction set with coverage guarantees in the mixture setting. Synthetic examples and an application to the prediction of critical temperatures of superconducting materials show our method to have reliable practical performance. 
    more » « less
  5. Health care–associated infections due to multidrug-resistant organisms (MDROs), such as methicillin-resistant Staphylococcus aureus (MRSA) and Clostridioides difficile (CDI), place a significant burden on our health care infrastructure. Screening for MDROs is an important mechanism for preventing spread but is resource intensive. The objective of this study was to develop automated tools that can predict colonization or infection risk using electronic health record (EHR) data, provide useful information to aid infection control, and guide empiric antibiotic coverage. We retrospectively developed a machine learning model to detect MRSA colonization and infection in undifferentiated patients at the time of sample collection from hospitalized patients at the University of Virginia Hospital. We used clinical and nonclinical features derived from on-admission and throughout-stay information from the patient’s EHR data to build the model. In addition, we used a class of features derived from contact networks in EHR data; these network features can capture patients’ contacts with providers and other patients, improving model interpretability and accuracy for predicting the outcome of surveillance tests for MRSA. Finally, we explored heterogeneous models for different patient subpopulations, for example, those admitted to an intensive care unit or emergency department or those with specific testing histories, which perform better. We found that the penalized logistic regression performs better than other methods, and this model’s performance measured in terms of its receiver operating characteristics-area under the curve score improves by nearly 11% when we use polynomial (second-degree) transformation of the features. Some significant features in predicting MDRO risk include antibiotic use, surgery, use of devices, dialysis, patient’s comorbidity conditions, and network features. Among these, network features add the most value and improve the model’s performance by at least 15%. The penalized logistic regression model with the same transformation of features also performs better than other models for specific patient subpopulations. Our study shows that MRSA risk prediction can be conducted quite effectively by machine learning methods using clinical and nonclinical features derived from EHR data. Network features are the most predictive and provide significant improvement over prior methods. Furthermore, heterogeneous prediction models for different patient subpopulations enhance the model’s performance. 
    more » « less