Matching on an estimated propensity score is frequently used to estimate the effects of treatments from observational data. Since the 1970s, different authors have proposed methods to combine matching at the design stage with regression adjustment at the analysis stage when estimating treatment effects for continuous outcomes. Previous work has consistently shown that the combination has generally superior statistical properties than either method by itself. In biomedical and epidemiological research, survival or time-to-event outcomes are common. We propose a method to combine regression adjustment and propensity score matching to estimate survival curves and hazard ratios based on estimating an imputed potential outcome under control for each successfully matched treated subject, which is accomplished using either an accelerated failure time parametric survival model or a Cox proportional hazard model that is fit to the matched control subjects. That is, a fitted model is then applied to the matched treated subjects to allow simulation of the missing potential outcome under control for each treated subject. Conventional survival analyses (e.g., estimation of survival curves and hazard ratios) can then be conducted using the observed outcome under treatment and the imputed outcome under control. We evaluated the repeated-sampling bias of the proposed methods using simulations. When using nearest neighbor matching, the proposed method resulted in decreased bias compared to crude analyses in the matched sample. We illustrate the method in an example prescribing beta-blockers at hospital discharge to patients hospitalized with heart failure.
In precision medicine, both predicting the disease susceptibility of an individual and forecasting its disease-free survival are areas of key research. Besides the classical epidemiological predictor variables, data from multiple (omic) platforms are increasingly available. To integrate this wealth of information, we propose new methodology to combine both cooperative learning, a recent approach to leverage the predictive power of several datasets, and polygenic hazard score models. Polygenic hazard score models provide a practitioner with a more differentiated view of the predicted disease-free survival than the one given by merely a point estimate, for instance computed with a polygenic risk score. Our aim is to leverage the advantages of cooperative learning for the computation of polygenic hazard score models via Cox’s proportional hazard model, thereby improving the prediction of the disease-free survival. In our experimental study, we apply our methodology to forecast the disease-free survival for Alzheimer’s disease (AD) using three layers of data. One layer contains epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status and 10 leading principal components. Another layer contains selected genomic loci, and the last layer contains methylation data for selected CpG sites. We demonstrate that the survival curves computed via cooperative learning yield an AUC of around $0.7$, above the state-of-the-art performance of its competitors. Importantly, the proposed methodology returns (1) a linear score that can be easily interpreted (in contrast to machine learning approaches), and (2) a weighting of the predictive power of the involved data layers, allowing for an assessment of the importance of each omic (or other) platform. Similarly to polygenic hazard score models, our methodology also allows one to compute individual survival curves for each patient.
more » « less- PAR ID:
- 10512414
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Briefings in Bioinformatics
- Volume:
- 25
- Issue:
- 4
- ISSN:
- 1467-5463
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Atopic dermatitis (AD) is a common skin disease in childhood whose diagnosis requires expertise in dermatology. Recent studies have indicated that host genes–microbial interactions in the gut contribute to human diseases including AD. We sought to develop an accurate and automated pipeline for AD diagnosis based on transcriptome and microbiota data. Using these data of 161 subjects including AD patients and healthy controls, we trained a machine learning classifier to predict the risk of AD. We found that the classifier could accurately differentiate subjects with AD and healthy individuals based on the omics data with an average F1-score of 0.84. With this classifier, we also identified a set of 35 genes and 50 microbiota features that are predictive for AD. Among the selected features, we discovered at least three genes and three microorganisms directly or indirectly associated with AD. Although further replications in other cohorts are needed, our findings suggest that these genes and microbiota features may provide novel biological insights and may be developed into useful biomarkers of AD prediction.
-
Beiko, Robert G (Ed.)
ABSTRACT Inflammatory bowel disease (IBD) is characterized by complex etiology and a disrupted colonic ecosystem. We provide a framework for the analysis of multi-omic data, which we apply to study the gut ecosystem in IBD. Specifically, we train and validate models using data on the metagenome, metatranscriptome, virome, and metabolome from the Human Microbiome Project 2 IBD multi-omic database, with 1,785 repeated samples from 130 individuals (103 cases and 27 controls). After splitting the participants into training and testing groups, we used mixed-effects least absolute shrinkage and selection operator regression to select features for each omic. These features, with demographic covariates, were used to generate separate single-omic prediction scores. All four single-omic scores were then combined into a final regression to assess the relative importance of the individual omics and the predictive benefits when considered together. We identified several species, pathways, and metabolites known to be associated with IBD risk, and we explored the connections between data sets. Individually, metabolomic and viromic scores were more predictive than metagenomics or metatranscriptomics, and when all four scores were combined, we predicted disease diagnosis with a Nagelkerke’s
R 2of 0.46 and an area under the curve of 0.80 (95% confidence interval: 0.63, 0.98). Our work supports that some single-omic models for complex traits are more predictive than others, that incorporating multiple omic data sets may improve prediction, and that each omic data type provides a combination of unique and redundant information. This modeling framework can be extended to other complex traits and multi-omic data sets.IMPORTANCE Complex traits are characterized by many biological and environmental factors, such that multi-omic data sets are well-positioned to help us understand their underlying etiologies. We applied a prediction framework across multiple omics (metagenomics, metatranscriptomics, metabolomics, and viromics) from the gut ecosystem to predict inflammatory bowel disease (IBD) diagnosis. The predicted scores from our models highlighted key features and allowed us to compare the relative utility of each omic data set in single-omic versus multi-omic models. Our results emphasized the importance of metabolomics and viromics over metagenomics and metatranscriptomics for predicting IBD status. The greater predictive capability of metabolomics and viromics is likely because these omics serve as markers of lifestyle factors such as diet. This study provides a modeling framework for multi-omic data, and our results show the utility of combining multiple omic data types to disentangle complex disease etiologies and biological signatures.
-
Abstract In the Alzheimer’s disease (AD) continuum, the prodromal state of mild cognitive impairment (MCI) precedes AD dementia and identifying MCI individuals at risk of progression is important for clinical management. Our goal was to develop generalizable multivariate models that integrate high-dimensional data (multimodal neuroimaging and cerebrospinal fluid biomarkers, genetic factors, and measures of cognitive resilience) for identification of MCI individuals who progress to AD within 3 years. Our main findings were i) we were able to build generalizable models with clinically relevant accuracy (~93%) for identifying MCI individuals who progress to AD within 3 years; ii) markers of AD pathophysiology (amyloid, tau, neuronal injury) accounted for large shares of the variance in predicting progression; iii) our methodology allowed us to discover that expression of
CR1 (complement receptor 1), an AD susceptibility gene involved in immune pathways, uniquely added independent predictive value. This work highlights the value of optimized machine learning approaches for analyzing multimodal patient information for making predictive assessments. -
OBJECTIVE To establish a polyexposure score (PXS) for type 2 diabetes (T2D) incorporating 12 nongenetic exposures and examine whether a PXS and/or a polygenic risk score (PGS) improves diabetes prediction beyond traditional clinical risk factors.
RESEARCH DESIGN AND METHODS We identified 356,621 unrelated individuals from the UK Biobank of White British ancestry with no prior diagnosis of T2D and normal HbA1c levels. Using self-reported and hospital admission information, we deployed a machine learning procedure to select the most predictive and robust factors out of 111 nongenetically ascertained exposure and lifestyle variables for the PXS in prospective T2D. We computed the clinical risk score (CRS) and PGS by taking a weighted sum of eight established clinical risk factors and >6 million single nucleotide polymorphisms, respectively.
RESULTS In the study population, 7,513 had incident T2D. The C-statistics for the PGS, PXS, and CRS models were 0.709, 0.762, and 0.839, respectively. Individuals in the top 10% of PGS, PXS, and CRS had 2.00-, 5.90-, and 9.97-fold greater risk, respectively, compared to the remaining population. Addition of PGS and PXS to CRS improved T2D classification accuracy, with a continuous net reclassification index of 15.2% and 30.1% for cases, respectively, and 7.3% and 16.9% for controls, respectively.
CONCLUSIONS For T2D, the PXS provides modest incremental predictive value over established clinical risk factors. However, the concept of PXS merits further consideration in T2D risk stratification and is likely to be useful in other chronic disease risk prediction models.