skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Dimension Reduction for Integrative Survival Analysis
Abstract

We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional predictors. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that each of the hazard functions depend only on a small number of linear combinations of the predictors (i.e., “factors”). We estimate these linear combinations using an algorithm based on “distance-to-set” penalties. This allows us to impose both low-rankness and sparsity on the regression coefficient matrix estimator. We derive asymptotic results that reveal that our estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, to validate our fitted model, we show that our estimated factors can lead to better prediction than competitors on four external datasets.

 
more » « less
Award ID(s):
2113589
NSF-PAR ID:
10486005
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
79
Issue:
3
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 1610-1623
Size(s):
p. 1610-1623
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Survival analysis problems often involve dual timescales, most commonly calendar date and lifetime, the latter being the elapsed time since an initiating event such as a heart transplant. In our main example attention is focused on the hazard rate of ‘death’ as a function of calendar date. Three different estimates are discussed, one each from proportional hazards analyses on the lifetime and the calendar date scales, and one from a symmetric approach called here the ‘two-way proportional hazards model’, a multiplicative hazards model going back to Lexis in the 1870s. The three are connected through a Poisson generalized linear model for the Lexis diagram. The two-way model is shown to combine the information from the two ‘one-way’ proportional hazards analyses efficiently, at the cost of more extensive parametric modelling.

     
    more » « less
  2. Abstract

    Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazard assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian treed hazards partition model that is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the log-hazard function in each partition using a latent Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is compared with some existing methods on simulated data and a liver cirrhosis dataset.

     
    more » « less
  3. Unlike standard prediction tasks, survival analysis requires modeling right censored data, which must be treated with care. While deep neural networks excel in traditional supervised learning, it remains unclear how to best utilize these models in survival analysis. A key question asks which data-generating assumptions of traditional survival models should be retained and which should be made more flexible via the function-approximating capabilities of neural networks. Rather than estimating the survival function targeted by most existing methods, we introduce a Deep Extended Hazard (DeepEH) model to provide a flexible and general framework for deep survival analysis. The extended hazard model includes the conventional Cox proportional hazards and accelerated failure time models as special cases, so DeepEH subsumes the popular Deep Cox proportional hazard (DeepSurv) and Deep Accelerated Failure Time (DeepAFT) models. We additionally provide theoretical support for the proposed DeepEH model by establishing consistency and convergence rate of the survival function estimator, which underscore the attractive feature that deep learning is able to detect low-dimensional structure of data in high-dimensional space. Numerical experiments also provide evidence that the proposed methods outperform existing statistical and deep learning approaches to survival analysis. 
    more » « less
  4. The ratio of the hazard functions of two populations or two strata of a single population plays an important role in time-to-event analysis. Cox regression is commonly used to estimate the hazard ratio under the assumption that it is constant in time, which is known as the proportional hazards assumption. However, this assumption is often violated in practice, and when it is violated, the parameter estimated by Cox regression is difficult to interpret. The hazard ratio can be estimated in a nonparametric manner using smoothing, but smoothing-based estimators are sensitive to the selection of tuning parameters, and it is often difficult to perform valid inference with such estimators. In some cases, it is known that the hazard ratio function is monotone. In this article, we demonstrate that monotonicity of the hazard ratio function defines an invariant stochastic order, and we study the properties of this order. Furthermore, we introduce an estimator of the hazard ratio function under a monotonicity constraint. We demonstrate that our estimator converges in distribution to a mean-zero limit, and we use this result to construct asymptotically valid confidence intervals. Finally, we conduct numerical studies to assess the finite-sample behavior of our estimator, and we use our methods to estimate the hazard ratio of progression-free survival in pulmonary adenocarcinoma patients treated with gefitinib or carboplatin-paclitaxel. 
    more » « less
  5. Abstract

    In precision medicine, both predicting the disease susceptibility of an individual and forecasting its disease-free survival are areas of key research. Besides the classical epidemiological predictor variables, data from multiple (omic) platforms are increasingly available. To integrate this wealth of information, we propose new methodology to combine both cooperative learning, a recent approach to leverage the predictive power of several datasets, and polygenic hazard score models. Polygenic hazard score models provide a practitioner with a more differentiated view of the predicted disease-free survival than the one given by merely a point estimate, for instance computed with a polygenic risk score. Our aim is to leverage the advantages of cooperative learning for the computation of polygenic hazard score models via Cox’s proportional hazard model, thereby improving the prediction of the disease-free survival. In our experimental study, we apply our methodology to forecast the disease-free survival for Alzheimer’s disease (AD) using three layers of data. One layer contains epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status and 10 leading principal components. Another layer contains selected genomic loci, and the last layer contains methylation data for selected CpG sites. We demonstrate that the survival curves computed via cooperative learning yield an AUC of around $0.7$, above the state-of-the-art performance of its competitors. Importantly, the proposed methodology returns (1) a linear score that can be easily interpreted (in contrast to machine learning approaches), and (2) a weighting of the predictive power of the involved data layers, allowing for an assessment of the importance of each omic (or other) platform. Similarly to polygenic hazard score models, our methodology also allows one to compute individual survival curves for each patient.

     
    more » « less