The Fine‐Gray proportional sub‐distribution hazards (PSH) model is among the most popular regression model for competing risks time‐to‐event data. This article develops a fast safe feature elimination method, named PSH‐SAFE, for fitting the penalized Fine‐Gray PSH model with a Lasso (or adaptive Lasso) penalty. Our PSH‐SAFE procedure is straightforward to implement, fast, and scales well to ultrahigh dimensional data. We also show that as a feature screening procedure, PSH‐SAFE is safe in a sense that the eliminated features are guaranteed to be inactive features in the original Lasso (or adaptive Lasso) estimator for the penalized PSH model. We evaluate the performance of the PSH‐SAFE procedure in terms of computational efficiency, screening efficiency and safety, run‐time, and prediction accuracy on multiple simulated datasets and a real bladder cancer data. Our empirical results show that the PSH‐SAFE procedure possesses desirable screening efficiency and safety properties and can offer substantially improved computational efficiency as well as similar or better prediction performance in comparison to their baseline competitors.
more »
« less
Cooperative learning for multiview analysis
We propose a method for supervised learning with multiple sets of features (“views”). The multiview problem is especially important in biology and medicine, where “-omics” data, such as genomics, proteomics, and radiomics, are measured on a common set of samples. “Cooperative learning” combines the usual squared-error loss of predictions with an “agreement” penalty to encourage the predictions from different data views to agree. By varying the weight of the agreement penalty, we get a continuum of solutions that include the well-known early and late fusion approaches. Cooperative learning chooses the degree of agreement (or fusion) in an adaptive manner, using a validation set or cross-validation to estimate test set prediction error. One version of our fitting procedure is modular, where one can choose different fitting mechanisms (e.g., lasso, random forests, boosting, or neural networks) appropriate for different data views. In the setting of cooperative regularized linear regression, the method combines the lasso penalty with the agreement penalty, yielding feature sparsity. The method can be especially powerful when the different data views share some underlying relationship in their signals that can be exploited to boost the signals. We show that cooperative learning achieves higher predictive accuracy on simulated data and real multiomics examples of labor-onset prediction. By leveraging aligned signals and allowing flexible fitting mechanisms for different modalities, cooperative learning offers a powerful approach to multiomics data fusion.
more »
« less
- Award ID(s):
- 2113389
- PAR ID:
- 10355842
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 119
- Issue:
- 38
- ISSN:
- 0027-8424
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In spite of its urgent importance in the era of big data, testing high-dimensional parameters in generalized linear models (GLMs) in the presence of high-dimensional nuisance parameters has been largely under-studied, especially with regard to constructing powerful tests for general (and unknown) alternatives. Most existing tests are powerful only against certain alternatives and may yield incorrect Type I error rates under high-dimensional nuisance parameter situations. In this paper, we propose the adaptive interaction sum of powered score (aiSPU) test in the framework of penalized regression with a non-convex penalty, called truncated Lasso penalty (TLP), which can maintain correct Type I error rates while yielding high statistical power across a wide range of alternatives. To calculate its p-values analytically, we derive its asymptotic null distribution. Via simulations, its superior finite-sample performance is demonstrated over several representative existing methods. In addition, we apply it and other representative tests to an Alzheimer’s Disease Neuroimaging Initiative (ADNI) data set, detecting possible gene-gender interactions for Alzheimer’s disease. We also put R package “aispu” implementing the proposed test on GitHub.more » « less
-
In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data.more » « less
-
In cold climates, ice formation on wind turbines causes power reduction produced by a wind farm. This paper introduces a framework to predict icing at the farm level based on our recently developed Temporal Convolutional Network prediction model for a single turbine using SCADA data.First, a cross-validation study is carried out to evaluate the extent predictors trained on a single turbine of a wind farm can be used to predict icing on the other turbines of a wind farm. This fusion approach combines multiple turbines, thereby providing predictions at the wind farm level. This study shows that such a fusion approach improves prediction accuracy and decreases fluctuations across different prediction horizons when compared with single-turbine prediction. Two approaches are considered to conduct farm-level icing prediction: decision fusion and feature fusion. In decision fusion, icing prediction decisions from individual turbines are combined in a majority voting manner. In feature fusion, features of individual turbines are averaged first before conducting prediction. The results obtained indicate that both the decision fusion and feature fusion approaches generate farm-level icing prediction accuracies that are 7% higher with lower standard deviations or fluctuations across different prediction horizons when compared with predictions for a single turbine.more » « less
-
Abstract Four statistical selection methods for inferring transcription factor (TF)–target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF–TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF–TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF–TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.more » « less
An official website of the United States government

