skip to main content


Title: Omnibus Risk Assessment via Accelerated Failure Time Kernel Machine Modeling
Summary

Integrating genomic information with traditional clinical risk factors to improve the prediction of disease outcomes could profoundly change the practice of medicine. However, the large number of potential markers and possible complexity of the relationship between markers and disease make it difficult to construct accurate risk prediction models. Standard approaches for identifying important markers often rely on marginal associations or linearity assumptions and may not capture non-linear or interactive effects. In recent years, much work has been done to group genes into pathways and networks. Integrating such biological knowledge into statistical learning could potentially improve model interpretability and reliability. One effective approach is to employ a kernel machine (KM) framework, which can capture nonlinear effects if nonlinear kernels are used (Scholkopf and Smola, 2002; Liu et al., 2007, 2008). For survival outcomes, KM regression modeling and testing procedures have been derived under a proportional hazards (PH) assumption (Li and Luan, 2003; Cai, Tonini, and Lin, 2011). In this article, we derive testing and prediction methods for KM regression under the accelerated failure time (AFT) model, a useful alternative to the PH model. We approximate the null distribution of our test statistic using resampling procedures. When multiple kernels are of potential interest, it may be unclear in advance which kernel to use for testing and estimation. We propose a robust Omnibus Test that combines information across kernels, and an approach for selecting the best kernel for estimation. The methods are illustrated with an application in breast cancer.

 
more » « less
NSF-PAR ID:
10484462
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
69
Issue:
4
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 861-873
Size(s):
["p. 861-873"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Lung cancer is the deadliest and second most common cancer in the United States due to the lack of symptoms for early diagnosis. Pulmonary nodules are small abnormal regions that can be potentially correlated to the occurrence of lung cancer. Early detection of these nodules is critical because it can significantly improve the patient's survival rates. Thoracic thin‐sliced computed tomography (CT) scanning has emerged as a widely used method for diagnosing and prognosis lung abnormalities.

    Purpose

    The standard clinical workflow of detecting pulmonary nodules relies on radiologists to analyze CT images to assess the risk factors of cancerous nodules. However, this approach can be error‐prone due to the various nodule formation causes, such as pollutants and infections. Deep learning (DL) algorithms have recently demonstrated remarkable success in medical image classification and segmentation. As an ever more important assistant to radiologists in nodule detection, it is imperative ensure the DL algorithm and radiologist to better understand the decisions from each other. This study aims to develop a framework integrating explainable AI methods to achieve accurate pulmonary nodule detection.

    Methods

    A robust and explainable detection (RXD) framework is proposed, focusing on reducing false positives in pulmonary nodule detection. Its implementation is based on an explanation supervision method, which uses nodule contours of radiologists as supervision signals to force the model to learn nodule morphologies, enabling improved learning ability on small dataset, and enable small dataset learning ability. In addition, two imputation methods are applied to the nodule region annotations to reduce the noise within human annotations and allow the model to have robust attributions that meet human expectations. The 480, 265, and 265 CT image sets from the public Lung Image Database Consortium and Image Database Resource Initiative (LIDC‐IDRI) dataset are used for training, validation, and testing.

    Results

    Using only 10, 30, 50, and 100 training samples sequentially, our method constantly improves the classification performance and explanation quality of baseline in terms of Area Under the Curve (AUC) and Intersection over Union (IoU). In particular, our framework with a learnable imputation kernel improves IoU from baseline by 24.0% to 80.0%. A pre‐defined Gaussian imputation kernel achieves an even greater improvement, from 38.4% to 118.8% from baseline. Compared to the baseline trained on 100 samples, our method shows less drop in AUC when trained on fewer samples. A comprehensive comparison of interpretability shows that our method aligns better with expert opinions.

    Conclusions

    A pulmonary nodule detection framework was demonstrated using public thoracic CT image datasets. The framework integrates the robust explanation supervision (RES) technique to ensure the performance of nodule classification and morphology. The method can reduce the workload of radiologists and enable them to focus on the diagnosis and prognosis of the potential cancerous pulmonary nodules at the early stage to improve the outcomes for lung cancer patients.

     
    more » « less
  2. Abstract

    With the rapid growth of modern technology, many biomedical studies are being conducted to collect massive datasets with volumes of multi‐modality imaging, genetic, neurocognitive and clinical information from increasingly large cohorts. Simultaneously extracting and integrating rich and diverse heterogeneous information in neuroimaging and/or genomics from these big datasets could transform our understanding of how genetic variants impact brain structure and function, cognitive function and brain‐related disease risk across the lifespan. Such understanding is critical for diagnosis, prevention and treatment of numerous complex brain‐related disorders (e.g., schizophrenia and Alzheimer's disease). However, the development of analytical methods for the joint analysis of both high‐dimensional imaging phenotypes and high‐dimensional genetic data, a big data squared (BD2) problem, presents major computational and theoretical challenges for existing analytical methods. Besides the high‐dimensional nature of BD2, various neuroimaging measures often exhibit strong spatial smoothness and dependence and genetic markers may have a natural dependence structure arising from linkage disequilibrium. We review some recent developments of various statistical techniques for imaging genetics, including massive univariate and voxel‐wise approaches, reduced rank regression, mixture models and group sparse multi‐task regression. By doing so, we hope that this review may encourage others in the statistical community to enter into this new and exciting field of research.The Canadian Journal of Statistics47: 108–131; 2019 © 2019 Statistical Society of Canada

     
    more » « less
  3. Summary

    We consider a functional linear Cox regression model for characterizing the association between time-to-event data and a set of functional and scalar predictors. The functional linear Cox regression model incorporates a functional principal component analysis for modeling the functional predictors and a high-dimensional Cox regression model to characterize the joint effects of both functional and scalar predictors on the time-to-event data. We develop an algorithm to calculate the maximum approximate partial likelihood estimates of unknown finite and infinite dimensional parameters. We also systematically investigate the rate of convergence of the maximum approximate partial likelihood estimates and a score test statistic for testing the nullity of the slope function associated with the functional predictors. We demonstrate our estimation and testing procedures by using simulations and the analysis of the Alzheimer's Disease Neuroimaging Initiative (ADNI) data. Our real data analyses show that high-dimensional hippocampus surface data may be an important marker for predicting time to conversion to Alzheimer's disease. Data used in the preparation of this article were obtained from the ADNI database (adni.loni.usc.edu).

     
    more » « less
  4. Background Heart failure is a leading cause of mortality and morbidity worldwide. Acute heart failure, broadly defined as rapid onset of new or worsening signs and symptoms of heart failure, often requires hospitalization and admission to the intensive care unit (ICU). This acute condition is highly heterogeneous and less well-understood as compared to chronic heart failure. The ICU, through detailed and continuously monitored patient data, provides an opportunity to retrospectively analyze decompensation and heart failure to evaluate physiological states and patient outcomes. Objective The goal of this study is to examine the prevalence of cardiovascular risk factors among those admitted to ICUs and to evaluate combinations of clinical features that are predictive of decompensation events, such as the onset of acute heart failure, using machine learning techniques. To accomplish this objective, we leveraged tele-ICU data from over 200 hospitals across the United States. Methods We evaluated the feasibility of predicting decompensation soon after ICU admission for 26,534 patients admitted without a history of heart failure with specific heart failure risk factors (ie, coronary artery disease, hypertension, and myocardial infarction) and 96,350 patients admitted without risk factors using remotely monitored laboratory, vital signs, and discrete physiological measurements. Multivariate logistic regression and random forest models were applied to predict decompensation and highlight important features from combinations of model inputs from dissimilar data. Results The most prevalent risk factor in our data set was hypertension, although most patients diagnosed with heart failure were admitted to the ICU without a risk factor. The highest heart failure prediction accuracy was 0.951, and the highest area under the receiver operating characteristic curve was 0.9503 with random forest and combined vital signs, laboratory values, and discrete physiological measurements. Random forest feature importance also highlighted combinations of several discrete physiological features and laboratory measures as most indicative of decompensation. Timeline analysis of aggregate vital signs revealed a point of diminishing returns where additional vital signs data did not continue to improve results. Conclusions Heart failure risk factors are common in tele-ICU data, although most patients that are diagnosed with heart failure later in an ICU stay presented without risk factors making a prediction of decompensation critical. Decompensation was predicted with reasonable accuracy using tele-ICU data, and optimal data extraction for time series vital signs data was identified near a 200-minute window size. Overall, results suggest combinations of laboratory measurements and vital signs are viable for early and continuous prediction of patient decompensation. 
    more » « less
  5. This paper studies theCox model with time-varying coefficients for cause-specific hazard functions when the causes of failure are subject to missingness. Inverse probability weighted and augmented inverse probability weighted estimators are investigated. The latter is considered as a two-stage estimator by directly utilizing the inverse probability weighted estimator and through modeling available auxiliary variables to improve efficiency. The asymptotic properties of the two estimators are investigated. Hypothesis testing procedures are developed to test the null hypotheses that the covariate effects are zero and that the covariate effects are constant. We conduct simulation studies to examine the finite sample properties of the proposed estimation and hypothesis testing procedures under various settings of the auxiliary variables and the percentages of the failure causes that are missing. These simulation results demonstrate that the augmented inverse probability weighted estimators are more efficient than the inverse probability weighted estimators and that the proposed testing procedures have the expected satisfactory results in sizes and powers. The proposed methods are illustrated using the Mashi clinical trial data for investigating the effect of randomization to formula-feeding versus breastfeeding plus extended infant zidovudine prophylaxis on death due to mother-to-child HIV transmission in Botswana. 
    more » « less