skip to main content

Title: Exploration into biomarker potential of region-specific brain gene co-expression networks
Abstract

The human brain is a complex organ that consists of several regions each with a unique gene expression pattern. Our intent in this study was to construct a gene co-expression network (GCN) for the normal brain using RNA expression profiles from the Genotype-Tissue Expression (GTEx) project. The brain GCN contains gene correlation relationships that are broadly present in the brain or specific to thirteen brain regions, which we later combined into six overarching brain mini-GCNs based on the brain’s structure. Using the expression profiles of brain region-specific GCN edges, we determined how well the brain region samples could be discriminated from each other, visually with t-SNE plots or quantitatively with the Gene Oracle deep learning classifier. Next, we tested these gene sets on their relevance to human tumors of brain and non-brain origin. Interestingly, we found that genes in the six brain mini-GCNs showed markedly higher mutation rates in tumors relative to matched sets of random genes. Further, we found that cortex genes subdivided Head and Neck Squamous Cell Carcinoma (HNSC) tumors and Pheochromocytoma and Paraganglioma (PCPG) tumors into distinct groups. The brain GCN and mini-GCNs are useful resources for the classification of brain regions and identification of biomarker more » genes for brain related phenotypes.

« less
Authors:
; ; ; ; ;
Award ID(s):
1725573 1659300
Publication Date:
NSF-PAR ID:
10198017
Journal Name:
Scientific Reports
Volume:
10
Issue:
1
ISSN:
2045-2322
Publisher:
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Gene co-expression networks (GCNs) are constructed from Gene Expression Matrices (GEMs) in a bottom up approach where all gene pairs are tested for correlation within the context of the input sample set. This approach is computationally intensive for many current GEMs and may not be scalable to millions of samples. Further, traditional GCNs do not detect non-linear relationships missed by correlation tests and do not place genetic relationships in a gene expression intensity context. In this report, we propose EdgeScaping, which constructs and analyzes the pairwise gene intensity network in a holistic, top down approach where no edges are filtered.more »EdgeScaping uses a novel technique to convert traditional pairwise gene expression data to an image based format. This conversion not only performs feature compression, making our algorithm highly scalable, but it also allows for exploring non-linear relationships between genes by leveraging deep learning image analysis algorithms. Using the learned embedded feature space we implement a fast, efficient algorithm to cluster the entire space of gene expression relationships while retaining gene expression intensity. Since EdgeScaping does not eliminate conventionally noisy edges, it extends the identification of co-expression relationships beyond classically correlated edges to facilitate the discovery of novel or unusual expression patterns within the network. We applied EdgeScaping to a human tumor GEM to identify sets of genes that exhibit conventional and non-conventional interdependent non-linear behavior associated with brain specific tumor sub-types that would be eliminated in conventional bottom-up construction of GCNs. Edgescaping source code is available at https://github.com/bhusain/EdgeScaping under the MIT license.« less
  2. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome inmore »the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implemented a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.« less
  3. Abstract Gene co-expression networks (GCNs) provide multiple benefits to molecular research including hypothesis generation and biomarker discovery. Transcriptome profiles serve as input for GCN construction and are derived from increasingly larger studies with samples across multiple experimental conditions, treatments, time points, genotypes, etc. Such experiments with larger numbers of variables confound discovery of true network edges, exclude edges and inhibit discovery of context (or condition) specific network edges. To demonstrate this problem, a 475-sample dataset is used to show that up to 97% of GCN edges can be misleading because correlations are false or incorrect. False and incorrect correlations canmore »occur when tests are applied without ensuring assumptions are met, and pairwise gene expression may not meet test assumptions if the expression of at least one gene in the pairwise comparison is a function of multiple confounding variables. The ‘one-size-fits-all’ approach to GCN construction is therefore problematic for large, multivariable datasets. Recently, the Knowledge Independent Network Construction toolkit has been used in multiple studies to provide a dynamic approach to GCN construction that ensures statistical tests meet assumptions and confounding variables are addressed. Additionally, it can associate experimental context for each edge of the network resulting in context-specific GCNs (csGCNs). To help researchers recognize such challenges in GCN construction, and the creation of csGCNs, we provide a review of the workflow.« less
  4. Goldman, Gustavo H. (Ed.)
    ABSTRACT Fungal secondary metabolites are widely used as therapeutics and are vital components of drug discovery programs. A major challenge hindering discovery of novel secondary metabolites is that the underlying pathways involved in their biosynthesis are transcriptionally silent under typical laboratory growth conditions, making it difficult to identify the transcriptional networks that they are embedded in. Furthermore, while the genes participating in secondary metabolic pathways are typically found in contiguous clusters on the genome, known as biosynthetic gene clusters (BGCs), this is not always the case, especially for global and pathway-specific regulators of pathways’ activities. To address these challenges, wemore »used 283 genome-wide gene expression data sets of the ascomycete cell factory Aspergillus niger generated during growth under 155 different conditions to construct two gene coexpression networks based on Spearman’s correlation coefficients (SCCs) and on mutual rank-transformed Pearson’s correlation coefficients (MR-PCCs). By mining these networks, we predicted six transcription factors, named MjkA to MjkF, to regulate secondary metabolism in A. niger . Overexpression of each transcription factor using the Tet-On cassette modulated the production of multiple secondary metabolites. We found that the SCC and MR-PCC approaches complemented each other, enabling the delineation of putative global (SCC) and pathway-specific (MR-PCC) transcription factors. These results highlight the potential of coexpression network approaches to identify and activate fungal secondary metabolic pathways and their products. More broadly, we argue that drug discovery programs in fungi should move beyond the BGC paradigm and focus on understanding the global regulatory networks in which secondary metabolic pathways are embedded. IMPORTANCE There is an urgent need for novel bioactive molecules in both agriculture and medicine. The genomes of fungi are thought to contain vast numbers of metabolic pathways involved in the biosynthesis of secondary metabolites with diverse bioactivities. Because these metabolites are biosynthesized only under specific conditions, the vast majority of the fungal pharmacopeia awaits discovery. To discover the genetic networks that regulate the activity of secondary metabolites, we examined the genome-wide profiles of gene activity of the cell factory Aspergillus niger across hundreds of conditions. By constructing global networks that link genes with similar activities across conditions, we identified six putative global and pathway-specific regulators of secondary metabolite biosynthesis. Our study shows that elucidating the behavior of the genetic networks of fungi under diverse conditions harbors enormous promise for understanding fungal secondary metabolism, which ultimately may lead to novel drug candidates.« less
  5. Gojobori, Takashi (Ed.)
    Identifying the molecular underpinnings of the neural specializations that underlie human cognitive and behavioral traits has long been of considerable interest. Much research on human-specific changes in gene expression and epigenetic marks has focused on the prefrontal cortex, a brain structure distinguished by its role in executive functions. The cerebellum shows expansion in great apes and is gaining increasing attention for its role in motor skills and cognitive processing, including language. However, relatively few molecular studies of the cerebellum in a comparative evolutionary context have been conducted. Here, we identify human-specific methylation in the lateral cerebellum relative to the dorsolateralmore »prefrontal cortex, in a comparative study with chimpanzees ( Pan troglodytes ) and rhesus macaques ( Macaca mulatta ). Specifically, we profiled genome-wide methylation levels in the three species for each of the two brain structures and identified human-specific differentially methylated genomic regions unique to each structure. We further identified which differentially methylated regions (DMRs) overlap likely regulatory elements and determined whether associated genes show corresponding species differences in gene expression. We found greater human-specific methylation in the cerebellum than the dorsolateral prefrontal cortex, with differentially methylated regions overlapping genes involved in several conditions or processes relevant to human neurobiology, including synaptic plasticity, lipid metabolism, neuroinflammation and neurodegeneration, and neurodevelopment, including developmental disorders. Moreover, our results show some overlap with those of previous studies focused on the neocortex, indicating that such results may be common to multiple brain structures. These findings further our understanding of the cerebellum in human brain evolution.« less