skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Dimension Reduction for Integrative Survival Analysis
Abstract We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional predictors. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that each of the hazard functions depend only on a small number of linear combinations of the predictors (i.e., “factors”). We estimate these linear combinations using an algorithm based on “distance-to-set” penalties. This allows us to impose both low-rankness and sparsity on the regression coefficient matrix estimator. We derive asymptotic results that reveal that our estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, to validate our fitted model, we show that our estimated factors can lead to better prediction than competitors on four external datasets.  more » « less
Award ID(s):
2113589 2415067
PAR ID:
10486005
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
79
Issue:
3
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 1610-1623
Size(s):
p. 1610-1623
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazard assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian treed hazards partition model that is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the log-hazard function in each partition using a latent Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is compared with some existing methods on simulated data and a liver cirrhosis dataset. 
    more » « less
  2. Unlike standard prediction tasks, survival analysis requires modeling right censored data, which must be treated with care. While deep neural networks excel in traditional supervised learning, it remains unclear how to best utilize these models in survival analysis. A key question asks which data-generating assumptions of traditional survival models should be retained and which should be made more flexible via the function-approximating capabilities of neural networks. Rather than estimating the survival function targeted by most existing methods, we introduce a Deep Extended Hazard (DeepEH) model to provide a flexible and general framework for deep survival analysis. The extended hazard model includes the conventional Cox proportional hazards and accelerated failure time models as special cases, so DeepEH subsumes the popular Deep Cox proportional hazard (DeepSurv) and Deep Accelerated Failure Time (DeepAFT) models. We additionally provide theoretical support for the proposed DeepEH model by establishing consistency and convergence rate of the survival function estimator, which underscore the attractive feature that deep learning is able to detect low-dimensional structure of data in high-dimensional space. Numerical experiments also provide evidence that the proposed methods outperform existing statistical and deep learning approaches to survival analysis. 
    more » « less
  3. The ratio of the hazard functions of two populations or two strata of a single population plays an important role in time-to-event analysis. Cox regression is commonly used to estimate the hazard ratio under the assumption that it is constant in time, which is known as the proportional hazards assumption. However, this assumption is often violated in practice, and when it is violated, the parameter estimated by Cox regression is difficult to interpret. The hazard ratio can be estimated in a nonparametric manner using smoothing, but smoothing-based estimators are sensitive to the selection of tuning parameters, and it is often difficult to perform valid inference with such estimators. In some cases, it is known that the hazard ratio function is monotone. In this article, we demonstrate that monotonicity of the hazard ratio function defines an invariant stochastic order, and we study the properties of this order. Furthermore, we introduce an estimator of the hazard ratio function under a monotonicity constraint. We demonstrate that our estimator converges in distribution to a mean-zero limit, and we use this result to construct asymptotically valid confidence intervals. Finally, we conduct numerical studies to assess the finite-sample behavior of our estimator, and we use our methods to estimate the hazard ratio of progression-free survival in pulmonary adenocarcinoma patients treated with gefitinib or carboplatin-paclitaxel. 
    more » « less
  4. Abstract Many popular survival models rely on restrictive parametric, or semiparametric, assumptions that could provide erroneous predictions when the effects of covariates are complex. Modern advances in computational hardware have led to an increasing interest in flexible Bayesian nonparametric methods for time-to-event data such as Bayesian additive regression trees (BART). We propose a novel approach that we call nonparametric failure time (NFT) BART in order to increase the flexibility beyond accelerated failure time (AFT) and proportional hazard models. NFT BART has three key features: (1) a BART prior for the mean function of the event time logarithm; (2) a heteroskedastic BART prior to deduce a covariate-dependent variance function; and (3) a flexible nonparametric error distribution using Dirichlet process mixtures (DPM). Our proposed approach widens the scope of hazard shapes including nonproportional hazards, can be scaled up to large sample sizes, naturally provides estimates of uncertainty via the posterior and can be seamlessly employed for variable selection. We provide convenient, user-friendly, computer software that is freely available as a reference implementation. Simulations demonstrate that NFT BART maintains excellent performance for survival prediction especially when AFT assumptions are violated by heteroskedasticity. We illustrate the proposed approach on a study examining predictors for mortality risk in patients undergoing hematopoietic stem cell transplant (HSCT) for blood-borne cancer, where heteroskedasticity and nonproportional hazards are likely present. 
    more » « less
  5. BackgroundMetastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology. ObjectiveThis study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques. MethodsWe performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold. ResultsXGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification. ConclusionsExplainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use. 
    more » « less