skip to main content


Title: Bayesian Classification of Tumours by Using Gene Expression Data
Summary

Precise classification of tumours is critical for the diagnosis and treatment of cancer. Diagnostic pathology has traditionally relied on macroscopic and microscopic histology and tumour morphology as the basis for the classification of tumours. Current classification frameworks, however, cannot discriminate between tumours with similar histopathologic features, which vary in clinical course and in response to treatment. In recent years, there has been a move towards the use of complementary deoxyribonucleic acid microarrays for the classi-fication of tumours. These high throughput assays provide relative messenger ribonucleic acid expression measurements simultaneously for thousands of genes. A key statistical task is to perform classification via different expression patterns. Gene expression profiles may offer more information than classical morphology and may provide an alternative to classical tumour diagnosis schemes. The paper considers several Bayesian classification methods based on reproducing kernel Hilbert spaces for the analysis of microarray data. We consider the logistic likelihood as well as likelihoods related to support vector machine models. It is shown through simulation and examples that support vector machine models with multiple shrinkage parameters produce fewer misclassification errors than several existing classical methods as well as Bayesian methods based on the logistic likelihood or those involving only one shrinkage parameter.

 
more » « less
NSF-PAR ID:
10405658
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
67
Issue:
2
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 219-234
Size(s):
["p. 219-234"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract STUDY QUESTION

    Can we derive adequate models to predict the probability of conception among couples actively trying to conceive?

    SUMMARY ANSWER

    Leveraging data collected from female participants in a North American preconception cohort study, we developed models to predict pregnancy with performance of ∼70% in the area under the receiver operating characteristic curve (AUC).

    WHAT IS KNOWN ALREADY

    Earlier work has focused primarily on identifying individual risk factors for infertility. Several predictive models have been developed in subfertile populations, with relatively low discrimination (AUC: 59–64%).

    STUDY DESIGN, SIZE, DURATION

    Study participants were female, aged 21–45 years, residents of the USA or Canada, not using fertility treatment, and actively trying to conceive at enrollment (2013–2019). Participants completed a baseline questionnaire at enrollment and follow-up questionnaires every 2 months for up to 12 months or until conception. We used data from 4133 participants with no more than one menstrual cycle of pregnancy attempt at study entry.

    PARTICIPANTS/MATERIALS, SETTING, METHODS

    On the baseline questionnaire, participants reported data on sociodemographic factors, lifestyle and behavioral factors, diet quality, medical history and selected male partner characteristics. A total of 163 predictors were considered in this study. We implemented regularized logistic regression, support vector machines, neural networks and gradient boosted decision trees to derive models predicting the probability of pregnancy: (i) within fewer than 12 menstrual cycles of pregnancy attempt time (Model I), and (ii) within 6 menstrual cycles of pregnancy attempt time (Model II). Cox models were used to predict the probability of pregnancy within each menstrual cycle for up to 12 cycles of follow-up (Model III). We assessed model performance using the AUC and the weighted-F1 score for Models I and II, and the concordance index for Model III.

    MAIN RESULTS AND THE ROLE OF CHANCE

    Model I and II AUCs were 70% and 66%, respectively, in parsimonious models, and the concordance index for Model III was 63%. The predictors that were positively associated with pregnancy in all models were: having previously breastfed an infant and using multivitamins or folic acid supplements. The predictors that were inversely associated with pregnancy in all models were: female age, female BMI and history of infertility. Among nulligravid women with no history of infertility, the most important predictors were: female age, female BMI, male BMI, use of a fertility app, attempt time at study entry and perceived stress.

    LIMITATIONS, REASONS FOR CAUTION

    Reliance on self-reported predictor data could have introduced misclassification, which would likely be non-differential with respect to the pregnancy outcome given the prospective design. In addition, we cannot be certain that all relevant predictor variables were considered. Finally, though we validated the models using split-sample replication techniques, we did not conduct an external validation study.

    WIDER IMPLICATIONS OF THE FINDINGS

    Given a wide range of predictor data, machine learning algorithms can be leveraged to analyze epidemiologic data and predict the probability of conception with discrimination that exceeds earlier work.

    STUDY FUNDING/COMPETING INTEREST(S)

    The research was partially supported by the U.S. National Science Foundation (under grants DMS-1664644, CNS-1645681 and IIS-1914792) and the National Institutes for Health (under grants R01 GM135930 and UL54 TR004130). In the last 3 years, L.A.W. has received in-kind donations for primary data collection in PRESTO from FertilityFriend.com, Kindara.com, Sandstone Diagnostics and Swiss Precision Diagnostics. L.A.W. also serves as a fibroid consultant to AbbVie, Inc. The other authors declare no competing interests.

    TRIAL REGISTRATION NUMBER

    N/A.

     
    more » « less
  2. Introduction

    Predictive models have been used to aid early diagnosis of PCOS, though existing models are based on small sample sizes and limited to fertility clinic populations. We built a predictive model using machine learning algorithms based on an outpatient population at risk for PCOS to predict risk and facilitate earlier diagnosis, particularly among those who meet diagnostic criteria but have not received a diagnosis.

    Methods

    This is a retrospective cohort study from a SafetyNet hospital’s electronic health records (EHR) from 2003-2016. The study population included 30,601 women aged 18-45 years without concurrent endocrinopathy who had any visit to Boston Medical Center for primary care, obstetrics and gynecology, endocrinology, family medicine, or general internal medicine. Four prediction outcomes were assessed for PCOS. The first outcome was PCOS ICD-9 diagnosis with additional model outcomes of algorithm-defined PCOS. The latter was based on Rotterdam criteria and merging laboratory values, radiographic imaging, and ICD data from the EHR to define irregular menstruation, hyperandrogenism, and polycystic ovarian morphology on ultrasound.

    Results

    We developed predictive models using four machine learning methods: logistic regression, supported vector machine, gradient boosted trees, and random forests. Hormone values (follicle-stimulating hormone, luteinizing hormone, estradiol, and sex hormone binding globulin) were combined to create a multilayer perceptron score using a neural network classifier. Prediction of PCOS prior to clinical diagnosis in an out-of-sample test set of patients achieved an average AUC of 85%, 81%, 80%, and 82%, respectively in Models I, II, III and IV. Significant positive predictors of PCOS diagnosis across models included hormone levels and obesity; negative predictors included gravidity and positive bHCG.

    Conclusion

    Machine learning algorithms were used to predict PCOS based on a large at-risk population. This approach may guide early detection of PCOS within EHR-interfaced populations to facilitate counseling and interventions that may reduce long-term health consequences. Our model illustrates the potential benefits of an artificial intelligence-enabled provider assistance tool that can be integrated into the EHR to reduce delays in diagnosis. However, model validation in other hospital-based populations is necessary.

     
    more » « less
  3. Machine learning (ML) methods, such as artificial neural networks (ANN), k-nearest neighbors (kNN), random forests (RF), support vector machines (SVM), and boosted decision trees (DTs), may offer stronger predictive performance than more traditional, parametric methods, such as linear regression, multiple linear regression, and logistic regression (LR), for specific mapping and modeling tasks. However, this increased performance is often accompanied by increased model complexity and decreased interpretability, resulting in critiques of their “black box” nature, which highlights the need for algorithms that can offer both strong predictive performance and interpretability. This is especially true when the global model and predictions for specific data points need to be explainable in order for the model to be of use. Explainable boosting machines (EBM), an augmentation and refinement of generalize additive models (GAMs), has been proposed as an empirical modeling method that offers both interpretable results and strong predictive performance. The trained model can be graphically summarized as a set of functions relating each predictor variable to the dependent variable along with heat maps representing interactions between selected pairs of predictor variables. In this study, we assess EBMs for predicting the likelihood or probability of slope failure occurrence based on digital terrain characteristics in four separate Major Land Resource Areas (MLRAs) in the state of West Virginia, USA and compare the results to those obtained with LR, kNN, RF, and SVM. EBM provided predictive accuracies comparable to RF and SVM and better than LR and kNN. The generated functions and visualizations for each predictor variable and included interactions between pairs of predictor variables, estimation of variable importance based on average mean absolute scores, and provided scores for each predictor variable for new predictions add interpretability, but additional work is needed to quantify how these outputs may be impacted by variable correlation, inclusion of interaction terms, and large feature spaces. Further exploration of EBM is merited for geohazard mapping and modeling in particular and spatial predictive mapping and modeling in general, especially when the value or use of the resulting predictions would be greatly enhanced by improved interpretability globally and availability of prediction explanations at each cell or aggregating unit within the mapped or modeled extent. 
    more » « less
  4. Abstract

    In this paper, we propose a sparse Bayesian procedure with global and local (GL) shrinkage priors for the problems of variable selection and classification in high‐dimensional logistic regression models. In particular, we consider two types of GL shrinkage priors for the regression coefficients, the horseshoe (HS) prior and the normal‐gamma (NG) prior, and then specify a correlated prior for the binary vector to distinguish models with the same size. The GL priors are then combined with mixture representations of logistic distribution to construct a hierarchical Bayes model that allows efficient implementation of a Markov chain Monte Carlo (MCMC) to generate samples from posterior distribution. We carry out simulations to compare the finite sample performances of the proposed Bayesian method with the existing Bayesian methods in terms of the accuracy of variable selection and prediction. Finally, two real‐data applications are provided for illustrative purposes.

     
    more » « less
  5. Abstract

    IRF family genes have been shown to be crucial in tumorigenesis and tumour immunity. However, information about the role of IRF in the systematic assessment of pan‐cancer and in predicting the efficacy of tumour therapy is still unknown. In this work, we performed a systematic analysis of IRF family genes in 33 tumour samples, including expression profiles, genomics and clinical characteristics. We then applied Single‐Sample Gene‐Set Enrichment Analysis (ssGSEA) to calculate IRF‐scores and analysed the impact of IRF‐scores on tumour progression, immune infiltration and treatment efficacy. Our results showed that genomic alterations, including SNPs, CNVs and DNA methylation, can lead to dysregulation of IRFs expression in tumours and participate in regulating multiple tumorigenesis. IRF‐score expression differed significantly between 12 normal and tumour samples and the impact on tumour prognosis and immune infiltration depended on tumour type. IRF expression was correlated to drug sensitivity and to the expression of immune checkpoints and immune cell infiltration, suggesting that dysregulation of IRF family expression may be a critical factor affecting tumour drug response. Our study comprehensively characterizes the genomic and clinical profile of IRFs in pan‐cancer and highlights their reliability and potential value as predictive markers of oncology drug efficacy. This may provide new ideas for future personalized oncology treatment.

     
    more » « less