skip to main content


Title: Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm
Abstract Background

Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.

Methods

Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction.

Results

Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcripts had a maximum accuracy of 90% in the testing cohort. The testing performance across all methods had an average area under ROC curve (AUC) = 0.97, an improvement over our previous models. The Random Forest model performed best across both training and testing cohorts. RT-qPCR confirmed expression differences in 7 of 9 genes tested. Gene ontology and IPA network analyses performed on the 37 model genes reflected dysregulated inflammation, cell signaling, and apoptosis processes. In our data, demographics and comorbidities did not affect model performance.

Conclusions

We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.

 
more » « less
NSF-PAR ID:
10306894
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Translational Medicine
Volume:
18
Issue:
1
ISSN:
1479-5876
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. BACKGROUND:

    Classification of perioperative risk is important for patient care, resource allocation, and guiding shared decision-making. Using discriminative features from the electronic health record (EHR), machine-learning algorithms can create digital phenotypes among heterogenous populations, representing distinct patient subpopulations grouped by shared characteristics, from which we can personalize care, anticipate clinical care trajectories, and explore therapies. We hypothesized that digital phenotypes in preoperative settings are associated with postoperative adverse events including in-hospital and 30-day mortality, 30-day surgical redo, intensive care unit (ICU) admission, and hospital length of stay (LOS).

    METHODS:

    We identified all laminectomies, colectomies, and thoracic surgeries performed over a 9-year period from a large hospital system. Seventy-seven readily extractable preoperative features were first selected from clinical consensus, including demographics, medical history, and lab results. Three surgery-specific datasets were built and split into derivation and validation cohorts using chronological occurrence. Consensusk-means clustering was performed independently on each derivation cohort, from which phenotypes’ characteristics were explored. Cluster assignments were used to train a random forest model to assign patient phenotypes in validation cohorts. We reconducted descriptive analyses on validation cohorts to confirm the similarity of patient characteristics with derivation cohorts, and quantified the association of each phenotype with postoperative adverse events by using the area under receiver operating characteristic curve (AUROC). We compared our approach to American Society of Anesthesiologists (ASA) alone and investigated a combination of our phenotypes with the ASA score.

    RESULTS:

    A total of 7251 patients met inclusion criteria, of which 2770 were held out in a validation dataset based on chronological occurrence. Using segmentation metrics and clinical consensus, 3 distinct phenotypes were created for each surgery. The main features used for segmentation included urgency of the procedure, preoperative LOS, age, and comorbidities. The most relevant characteristics varied for each of the 3 surgeries. Low-risk phenotype alpha was the most common (2039 of 2770, 74%), while high-risk phenotype gamma was the rarest (302 of 2770, 11%). Adverse outcomes progressively increased from phenotypes alpha to gamma, including 30-day mortality (0.3%, 2.1%, and 6.0%, respectively), in-hospital mortality (0.2%, 2.3%, and 7.3%), and prolonged hospital LOS (3.4%, 22.1%, and 25.8%). When combined with the ASA score, digital phenotypes achieved higher AUROC than the ASA score alone (hospital mortality: 0.91 vs 0.84; prolonged hospitalization: 0.80 vs 0.71).

    CONCLUSIONS:

    For 3 frequently performed surgeries, we identified 3 digital phenotypes. The typical profiles of each phenotype were described and could be used to anticipate adverse postoperative events.

     
    more » « less
  2. Abstract Motivation

    Accurate estimation of transcript isoform abundance is critical for downstream transcriptome analyses and can lead to precise molecular mechanisms for understanding complex human diseases, like cancer. Simplex mRNA Sequencing (RNA-Seq) based isoform quantification approaches are facing the challenges of inherent sampling bias and unidentifiable read origins. A large-scale experiment shows that the consistency between RNA-Seq and other mRNA quantification platforms is relatively low at the isoform level compared to the gene level. In this project, we developed a platform-integrated model for transcript quantification (IntMTQ) to improve the performance of RNA-Seq on isoform expression estimation. IntMTQ, which benefits from the mRNA expressions reported by the other platforms, provides more precise RNA-Seq-based isoform quantification and leads to more accurate molecular signatures for disease phenotype prediction.

    Results

    In the experiments to assess the quality of isoform expression estimated by IntMTQ, we designed three tasks for clustering and classification of 46 cancer cell lines with four different mRNA quantification platforms, including newly developed NanoString’s nCounter technology. The results demonstrate that the isoform expressions learned by IntMTQ consistently provide more and better molecular features for downstream analyses compared with five baseline algorithms which consider RNA-Seq data only. An independent RT-qPCR experiment on seven genes in twelve cancer cell lines showed that the IntMTQ improved overall transcript quantification. The platform-integrated algorithms could be applied to large-scale cancer studies, such as The Cancer Genome Atlas (TCGA), with both RNA-Seq and array-based platforms available.

    Availability and implementation

    Source code is available at: https://github.com/CompbioLabUcf/IntMTQ.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Background:

    Remote patient monitoring (RPM) programs augment type 1 diabetes (T1D) care based on retrospective continuous glucose monitoring (CGM) data. Few methods are available to estimate the likelihood of a patient experiencing clinically significant hypoglycemia within one week.

    Methods:

    We developed a machine learning model to estimate the probability that a patient will experience a clinically significant hypoglycemic event, defined as CGM readings below 54 mg/dL for at least 15 consecutive minutes, within one week. The model takes as input the patient’s CGM time series over a given week, and outputs the predicted probability of a clinically significant hypoglycemic event the following week. We used 10-fold cross-validation and external validation (testing on cohorts different from the training cohort) to evaluate performance. We used CGM data from three different cohorts of patients with T1D: REPLACE-BG (226 patients), Juvenile Diabetes Research Foundation (JDRF; 355 patients) and Tidepool (120 patients).

    Results:

    In 10-fold cross-validation, the average area under the receiver operating characteristic curve (ROC-AUC) was 0.77 (standard deviation [SD]: 0.0233) on the REPLACE-BG cohort, 0.74 (SD: 0.0188) on the JDRF cohort, and 0.76 (SD: 0.02) on the Tidepool cohort. In external validation, the average ROC-AUC across the three cohorts was 0.74 (SD: 0.0262).

    Conclusions:

    We developed a machine learning algorithm to estimate the probability of a clinically significant hypoglycemic event within one week. Predictive algorithms may provide diabetes care providers using RPM with additional context when prioritizing T1D patients for review.

     
    more » « less
  4. Abstract Background The pond snail, Lymnaea stagnalis ( L. stagnalis ), has served as a valuable model organism for neurobiology studies due to its simple and easily accessible central nervous system (CNS). L. stagnalis has been widely used to study neuronal networks and recently gained popularity for study of aging and neurodegenerative diseases. However, previous transcriptome studies of L. stagnalis CNS have been exclusively carried out on adult L. stagnalis only. As part of our ongoing effort studying L. stagnalis neuronal growth and connectivity at various developmental stages, we provide the first age-specific transcriptome analysis and gene annotation of young (3 months), adult (6 months), and old (18 months) L. stagnalis CNS. Results Using the above three age cohorts, our study generated 55–69 millions of 150 bp paired-end RNA sequencing reads using the Illumina NovaSeq 6000 platform. Of these reads, ~ 74% were successfully mapped to the reference genome of L. stagnalis . Our reference-based transcriptome assembly predicted 42,478 gene loci, of which 37,661 genes encode coding sequences (CDS) of at least 100 codons. In addition, we provide gene annotations using Blast2GO and functional annotations using Pfam for ~ 95% of these sequences, contributing to the largest number of annotated genes in L. stagnalis CNS so far. Moreover, among 242 previously cloned L. stagnalis genes, we were able to match ~ 87% of them in our transcriptome assembly, indicating a high percentage of gene coverage. The expressional differences for innexins, FMRFamide, and molluscan insulin peptide genes were validated by real-time qPCR. Lastly, our transcriptomic analyses revealed distinct, age-specific gene clusters, differentially expressed genes, and enriched pathways in young, adult, and old CNS. More specifically, our data show significant changes in expression of critical genes involved in transcription factors, metabolisms (e.g. cytochrome P450), extracellular matrix constituent, and signaling receptor and transduction (e.g. receptors for acetylcholine, N-Methyl-D-aspartic acid, and serotonin), as well as stress- and disease-related genes in young compared to either adult or old snails. Conclusions Together, these datasets are the largest and most updated L. stagnalis CNS transcriptomes, which will serve as a resource for future molecular studies and functional annotation of transcripts and genes in L. stagnalis . 
    more » « less
  5. Abstract

    The strain on healthcare resources brought forth by the recent COVID-19 pandemic has highlighted the need for efficient resource planning and allocation through the prediction of future consumption. Machine learning can predict resource utilization such as the need for hospitalization based on past medical data stored in electronic medical records (EMR). We conducted this study on 3194 patients (46% male with mean age 56.7 (±16.8), 56% African American, 7% Hispanic) flagged as COVID-19 positive cases in 12 centers under Emory Healthcare network from February 2020 to September 2020, to assess whether a COVID-19 positive patient’s need for hospitalization can be predicted at the time of RT-PCR test using the EMR data prior to the test. Five main modalities of EMR, i.e., demographics, medication, past medical procedures, comorbidities, and laboratory results, were used as features for predictive modeling, both individually and fused together using late, middle, and early fusion. Models were evaluated in terms of precision, recall, F1-score (within 95% confidence interval). The early fusion model is the most effective predictor with 84% overall F1-score [CI 82.1–86.1]. The predictive performance of the model drops by 6 % when using recent clinical data while omitting the long-term medical history. Feature importance analysis indicates that history of cardiovascular disease, emergency room visits in the past year prior to testing, and demographic factors are predictive of the disease trajectory. We conclude that fusion modeling using medical history and current treatment data can forecast the need for hospitalization for patients infected with COVID-19 at the time of the RT-PCR test.

     
    more » « less