skip to main content

Title: Graph Theoretic and Pearson Correlation-Based Discovery of Network Biomarkers for Cancer
Two graph theoretic concepts—clique and bipartite graphs—are explored to identify the network biomarkers for cancer at the gene network level. The rationale is that a group of genes work together by forming a cluster or a clique-like structures to initiate a cancer. After initiation, the disease signal goes to the next group of genes related to the second stage of a cancer, which can be represented as a bipartite graph. In other words, bipartite graphs represent the cross-talk among the genes between two disease stages. To prove this hypothesis, gene expression values for three cancers— breast invasive carcinoma (BRCA), colorectal adenocarcinoma (COAD) and glioblastoma multiforme (GBM)—are used for analysis. First, a co-expression gene network is generated with highly correlated gene pairs with a Pearson correlation coefficient ≥ 0.9. Second, clique structures of all sizes are isolated from the co-expression network. Then combining these cliques, three different biomarker modules are developed—maximal clique-like modules, 2-clique-1-bipartite modules, and 3-clique-2-bipartite modules. The list of biomarker genes discovered from these network modules are validated as the essential genes for causing a cancer in terms of network properties and survival analysis. This list of biomarker genes will help biologists to design wet lab experiments for further more » elucidating the complex mechanism of cancer. « less
Authors:
; ; ; ;
Award ID(s):
1901628
Publication Date:
NSF-PAR ID:
10141527
Journal Name:
Data
Volume:
4
Issue:
2
Page Range or eLocation-ID:
81
ISSN:
2306-5729
Sponsoring Org:
National Science Foundation
More Like this
  1. Alzheimer’s disease (AD) and Parkinson’s disease (PD) are the most common neurodegenerative disorders related to aging. Though several risk factors are shared between these two diseases, the exact relationship between them is still unknown. In this paper, we analyzed how these two diseases relate to each other from the genomic, epigenomic, and transcriptomic viewpoints. Using an extensive literature mining, we first accumulated the list of genes from major genome-wide association (GWAS) studies. Based on these GWAS studies, we observed that only one gene (HLA-DRB5) was shared between AD and PD. A subsequent literature search identified a few other genes involvedmore »in these two diseases, among which SIRT1 seemed to be the most prominent one. While we listed all the miRNAs that have been previously reported for AD and PD separately, we found only 15 different miRNAs that were reported in both diseases. In order to get better insights, we predicted the gene co-expression network for both AD and PD using network analysis algorithms applied to two GEO datasets. The network analysis revealed six clusters of genes related to AD and four clusters of genes related to PD; however, there was very low functional similarity between these clusters, pointing to insignificant similarity between AD and PD even at the level of affected biological processes. Finally, we postulated the putative epigenetic regulator modules that are common to AD and PD.« less
  2. Abstract Background Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules andmore »biological signatures for gene expression and disease classification. Results We propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs. Conclusion Extracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.« less
  3. We consider how to generate graphs of arbitrary size whose chromatic numbers can be chosen (or are well-bounded) for testing graph coloring algorithms on parallel computers. For the distance-1 graph coloring problem, we identify three classes of graphs with this property. The first is the Erdős-Rényi random graph with prescribed expected degree, where the chromatic number is known with high probability. It is also known that the Greedy algorithm colors this graph using at most twice the number of colors as the chromatic number. The second is a random geometric graph embedded in hyperbolic space where the size of themore »maximum clique provides a tight lower bound on the chromatic number. The third is a deterministic graph described by Mycielski, where the graph is recursively constructed such that its chromatic number is known and increases with graph size, although the size of the maximum clique remains two. For Jacobian estimation, we bound the distance-2 chromatic number of random bipartite graphs by considering its equivalence to distance-1 coloring of an intersection graph. We use a “balls and bins” probabilistic analysis to establish a lower bound and an upper bound on the distance-2 chromatic number. The regimes of graph sizes and probabilities that we consider are chosen to suit the Jacobian estimation problem, where the number of columns and rows are asymptotically nearly equal, and have number of nonzeros linearly related to the number of columns. Computationally we verify the theoretical predictions and show that the graphs are often be colored optimally by the serial and parallel Greedy algorithms.« less
  4. Abstract Background Drug sensitivity prediction and drug responsive biomarker selection on high-throughput genomic data is a critical step in drug discovery. Many computational methods have been developed to serve this purpose including several deep neural network models. However, the modular relations among genomic features have been largely ignored in these methods. To overcome this limitation, the role of the gene co-expression network on drug sensitivity prediction is investigated in this study. Methods In this paper, we first introduce a network-based method to identify representative features for drug response prediction by using the gene co-expression network. Then, two graph-based neural networkmore »models are proposed and both models integrate gene network information directly into neural network for outcome prediction. Next, we present a large-scale comparative study among the proposed network-based methods, canonical prediction algorithms (i.e., Elastic Net, Random Forest, Partial Least Squares Regression, and Support Vector Regression), and deep neural network models for drug sensitivity prediction. All the source code and processed datasets in this study are available at https://github.com/compbiolabucf/drug-sensitivity-prediction . Results In the comparison of different feature selection methods and prediction methods on a non-small cell lung cancer (NSCLC) cell line RNA-seq gene expression dataset with 50 different drug treatments, we found that (1) the network-based feature selection method improves the prediction performance compared to Pearson correlation coefficients; (2) Random Forest outperforms all the other canonical prediction algorithms and deep neural network models; (3) the proposed graph-based neural network models show better prediction performance compared to deep neural network model; (4) the prediction performance is drug dependent and it may relate to the drug’s mechanism of action. Conclusions Network-based feature selection method and prediction models improve the performance of the drug response prediction. The relations between the genomic features are more robust and stable compared to the correlation between each individual genomic feature and the drug response in high dimension and low sample size genomic datasets.« less
  5. Kelso, Janet (Ed.)
    Abstract Motivation Genetic or epigenetic events can rewire molecular networks to induce extraordinary phenotypical divergences. Among the many network rewiring approaches, no model-free statistical methods can differentiate gene-gene pattern changes not attributed to marginal changes. This may obscure fundamental rewiring from superficial changes. Results Here we introduce a model-free Sharma-Song test to determine if patterns differ in the second order, meaning that the deviation of the joint distribution from the product of marginal distributions is unequal across conditions. We prove an asymptotic chi-squared null distribution for the test statistic. Simulation studies demonstrate its advantage over alternative methods in detecting second-ordermore »differential patterns. Applying the test on three independent mammalian developmental transcriptome datasets, we report a lower frequency of co-expression network rewiring between human and mouse for the same tissue group than the frequency of rewiring between tissue groups within the same species. We also find secondorder differential patterns between microRNA promoters and genes contrasting cerebellum and liver development in mice. These patterns are enriched in the spliceosome pathway regulating tissue specificity. Complementary to previous mammalian comparative studies mostly driven by first-order effects, our findings contribute an understanding of system-wide second-order gene network rewiring within and across mammalian systems. Second-order differential patterns constitute evidence for fundamentally rewired biological circuitry due to evolution, environment, or disease. Availability The generic Sharma-Song test is available from the R package ‘DiffXTables’ at https://cran.r-project.org/package=DiffXTables. Other code and data are described in Methods. Supplementary information Supplementary data are available at Bioinformatics online.« less