- Publication Date:
- NSF-PAR ID:
- Journal Name:
- 2019 IEEE International Conference on Bioinformatics and Biomedicine (IEEE BIBM)
- Page Range or eLocation-ID:
- 2825 to 2831
- Sponsoring Org:
- National Science Foundation
More Like this
Background: Long non-coding RNA plays a vital role in changing the expression profiles of various target genes that lead to cancer development. Thus, identifying prognostic lncRNAs related to different cancers might help in developing cancer therapy. Method: To discover the critical lncRNAs that can identify the origin of different cancers, we propose the use of the state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. We thus propose a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. The assumption is that a feature appearing in multiple runs carries more meaningful information about the data under consideration. The genome-wide lncRNA expression profiles of 12 different types of cancers, with a total of 4768 samples available in The Cancer Genome Atlas (TCGA), were analyzed to discover the key lncRNAs. The lncRNAs identified by multiple runs of CAE were added to a final list of key lncRNAs that are capable of identifying 12 different cancers. Results: Our results showed that mrCAE performs better in feature selection than single-run CAE, standardmore »
Obeid, I. (Ed.)The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples , as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” . The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition , image recognition  and text processing  are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »
Abstract Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers.
Classification models using circulating neutrophil transcripts can detect unruptured intracranial aneurysm
Intracranial aneurysms (IAs) are dangerous because of their potential to rupture. We previously found significant RNA expression differences in circulating neutrophils between patients with and without unruptured IAs and trained machine learning models to predict presence of IA using 40 neutrophil transcriptomes. Here, we aim to develop a predictive model for unruptured IA using neutrophil transcriptomes from a larger population and more robust machine learning methods.
Neutrophil RNA extracted from the blood of 134 patients (55 with IA, 79 IA-free controls) was subjected to next-generation RNA sequencing. In a randomly-selected training cohort (n = 94), the Least Absolute Shrinkage and Selection Operator (LASSO) selected transcripts, from which we constructed prediction models via 4 well-established supervised machine-learning algorithms (K-Nearest Neighbors, Random Forest, and Support Vector Machines with Gaussian and cubic kernels). We tested the models in the remaining samples (n = 40) and assessed model performance by receiver-operating-characteristic (ROC) curves. Real-time quantitative polymerase chain reaction (RT-qPCR) of 9 IA-associated genes was used to verify gene expression in a subset of 49 neutrophil RNA samples. We also examined the potential influence of demographics and comorbidities on model prediction.
Feature selection using LASSO in the training cohort identified 37 IA-associated transcripts. Models trained using these transcriptsmore »
We improved upon our previous IA prediction models based on circulating neutrophil transcriptomes by increasing sample size and by implementing LASSO and more robust machine learning methods. Future studies are needed to validate these models in larger cohorts and further investigate effect of covariates.
Finding the network biomarkers of cancers and the analysis of cancer driving genes that are involved in these biomarkers are essential for understanding the dynamics of cancer. Clusters of genes in co-expression networks are commonly known as functional units. This work is based on the hypothesis that the dense clusters or communities in the gene co-expression networks of cancer patients may represent functional units regarding cancer initiation and progression. In this study, RNA-seq gene expression data of three cancers - Breast Invasive Carcinoma (BRCA), Colorectal Adenocarcinoma (COAD) and Glioblastoma Multiforme (GBM) - from The Cancer Genome Atlas (TCGA) are used to construct gene co-expression networks using Pearson Correlation. Six well-known community detection algorithms are applied on these networks to identify communities with five or more genes. A permutation test is performed to further mine the communities that are conserved in other cancers, thus calling them conserved communities. Then survival analysis is performed on clinical data of three cancers using the conserved community genes as prognostic co-variates. The communities that could distinguish the cancer patients between high- and low-risk groups are considered as cancer biomarkers. In the present study, 16 such network biomarkers are discovered.