Title: Dr.Nod: computational framework for discovery of regulatory non-coding drivers in tissue-matched distal regulatory elements
Abstract
The discovery of cancer driver mutations is a fundamental goal in cancer research. While many cancer driver mutations have been discovered in the protein-coding genome, research into potential cancer drivers in the non-coding regions showed limited success so far. Here, we present a novel comprehensive framework Dr.Nod for detection of non-coding cis-regulatory candidate driver mutations that are associated with dysregulated gene expression using tissue-matched enhancer-gene annotations. Applying the framework to data from over 1500 tumours across eight tissues revealed a 4.4-fold enrichment of candidate driver mutations in regulatory regions of known cancer driver genes. An overarching conclusion that emerges is that the non-coding driver mutations contribute to cancer by significantly altering transcription factor binding sites, leading to upregulation of tissue-matched oncogenes and down-regulation of tumour-suppressor genes. Interestingly, more than half of the detected cancer-promoting non-coding regulatory driver mutations are over 20 kb distant from the cancer-associated genes they regulate. Our results show the importance of tissue-matched enhancer-gene maps, functional impact of mutations, and complex background mutagenesis model for the prediction of non-coding regulatory drivers. In conclusion, our study demonstrates that non-coding mutations in enhancers play a previously underappreciated role in cancer and dysregulation of clinically relevant target genes.
Synonymous mutations, which change the DNA sequence but not the encoded protein sequence, can affect protein structure and function, mRNA maturation, and mRNA half-lives. The possibility that synonymous mutations might be enriched in cancer has been explored in several recent studies. However, none of these studies control for all three types of mutational heterogeneity (patient, histology, and gene) that are known to affect the accurate identification of non-synonymous cancer-associated genes. Our goal is to adopt the current standard for non-synonymous mutations in an investigation of synonymous mutations.
Results
Here, we create an algorithm, MutSigCVsyn, an adaptation of MutSigCV, to identify cancer-associated genes that are enriched for synonymous mutations based on a non-coding background model that takes into account the mutational heterogeneity across these levels. Using MutSigCVsyn, we first analyzed 2572 cancer whole-genome samples from the Pan-cancer Analysis of Whole Genomes (PCAWG) to identify non-synonymous cancer drivers as a quality control. Indicative of the algorithm accuracy we find that 58.6% of these candidate genes were also found in Cancer Census Gene (CGC) list, and 66.2% were found within the PCAWG cancer driver list. We then applied it to identify 30 putative cancer-associated genes that are enriched for synonymous mutations within the same samples. One of the promising gene candidates is the B cell lymphoma 2 (BCL-2) gene. BCL-2 regulates apoptosis by antagonizing the action of proapoptotic BCL-2 family member proteins. The synonymous mutations in BCL2 are enriched in its anti-apoptotic domain and likely play a role in cancer cell proliferation.
Conclusion
Our study introduces MutSigCVsyn, an algorithm that accounts for mutational heterogeneity at patient, histology, and gene levels, to identify cancer-associated genes that are enriched for synonymous mutations using whole genome sequencing data. We identified 30 putative candidate genes that will benefit from future experimental studies on the role of synonymous mutations in cancer biology.
Kaplow, Irene M.; Lawler, Alyssa J.; Schäffer, Daniel E.; Srinivasan, Chaitanya; Sestili, Heather H.; Wirthlin, Morgan E.; Phan, BaDoi N.; Prasad, Kavya; Brown, Ashley R.; Zhang, Xiaomeng; et al(
, Science)
INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic]
Selvam, Kathiresan; Sivapragasam, Smitha; Poon, Gregory M.; Wyrick, John J.(
, Nature Communications)
Abstract Sequencing of melanomas has identified hundreds of recurrent mutations in both coding and non-coding DNA. These include a number of well-characterized oncogenic driver mutations, such as coding mutations in the BRAF and NRAS oncogenes, and non-coding mutations in the promoter of telomerase reverse transcriptase ( TERT ). However, the molecular etiology and significance of most of these mutations is unknown. Here, we use a new method known as CPD-capture-seq to map UV-induced cyclobutane pyrimidine dimers (CPDs) with high sequencing depth and single nucleotide resolution at sites of recurrent mutations in melanoma. Our data reveal that many previously identified drivers and other recurrent mutations in melanoma occur at CPD hotspots in UV-irradiated melanocytes, often associated with an overlapping binding site of an E26 transformation-specific (ETS) transcription factor. In contrast, recurrent mutations in the promoters of a number of known or suspected cancer genes are not associated with elevated CPD levels. Our data indicate that a subset of recurrent protein-coding mutations are also likely caused by ETS-induced CPD hotspots. This analysis indicates that ETS proteins profoundly shape the mutation landscape of melanoma and reveals a method for distinguishing potential driver mutations from passenger mutations whose recurrence is due to elevated UV damage.
Changes in cis-regulatory elements play important roles in adaptation and phenotypic evolution. However, their contribution to metabolic adaptation of organisms is less understood. Here we have utilized a unique vertebrate model, Astyanax mexicanus, different morphotypes of which survive in nutrient-rich surface and nutrient-deprived cave water to uncover gene regulatory networks in metabolic adaptation. We performed genome-wide epigenetic profiling in the liver tissue of one surface and two independently derived cave populations. We find that many cis-regulatory elements differ in their epigenetic status/chromatin accessibility between surface fish and cavefish, while the two independently derived cave populations have evolved remarkably similar regulatory signatures. These differentially accessible regions are associated with genes of key pathways related to lipid metabolism, circadian rhythm and immune system that are known to be altered in cavefish. Using in vitro and in vivo functional testing of the candidate cis-regulatory elements, we find that genetic changes within them cause quantitative expression differences. We characterized one cis-regulatory element in the hpdb gene and found a genomic deletion in cavefish that abolishes binding of the transcriptional repressor IRF2 in vitro and derepresses enhancer activity in reporter assays. Genetic experiments further validated a cis-mediated role of the enhancer and suggest a role of this deletion in the upregulation of hpdb in wild cavefish populations. Selection of this mutation in multiple independent cave populations supports its importance in the adaptation to the cave environment, providing novel molecular insights into the evolutionary trade-off between loss of pigmentation and adaptation to a food-deprived cave environment.
Holm, Inge; Nardini, Luisa; Pain, Adrien; Bischoff, Emmanuel; Anderson, Cameron E.; Zongo, Soumanaba; Guelbeogo, Wamdaogo M.; Sagnon, N’Fale; Gohl, Daryl M.; Nowling, Ronald J.; et al(
, Frontiers in Genetics)
Almost all regulation of gene expression in eukaryotic genomes is mediated by the action of distant non-coding transcriptional enhancers upon proximal gene promoters. Enhancer locations cannot be accurately predicted bioinformatically because of the absence of a defined sequence code, and thus functional assays are required for their direct detection. Here we used a massively parallel reporter assay, Self-Transcribing Active Regulatory Region sequencing (STARR-seq), to generate the first comprehensive genome-wide map of enhancers in Anopheles coluzzii , a major African malaria vector in the Gambiae species complex. The screen was carried out by transfecting reporter libraries created from the genomic DNA of 60 wild A. coluzzii from Burkina Faso into A. coluzzii 4a3A cells, in order to functionally query enhancer activity of the natural population within the homologous cellular context. We report a catalog of 3,288 active genomic enhancers that were significant across three biological replicates, 74% of them located in intergenic and intronic regions. The STARR-seq enhancer screen is chromatin-free and thus detects inherent activity of a comprehensive catalog of enhancers that may be restricted in vivo to specific cell types or developmental stages. Testing of a validation panel of enhancer candidates using manual luciferase assays confirmed enhancer function in 26 of 28 (93%) of the candidates over a wide dynamic range of activity from two to at least 16-fold activity above baseline. The enhancers occupy only 0.7% of the genome, and display distinct composition features. The enhancer compartment is significantly enriched for 15 transcription factor binding site signatures, and displays divergence for specific dinucleotide repeats, as compared to matched non-enhancer genomic controls. The genome-wide catalog of A. coluzzii enhancers is publicly available in a simple searchable graphic format. This enhancer catalogue will be valuable in linking genetic and phenotypic variation, in identifying regulatory elements that could be employed in vector manipulation, and in better targeting of chromosome editing to minimize extraneous regulation influences on the introduced sequences. Importance: Understanding the role of the non-coding regulatory genome in complex disease phenotypes is essential, but even in well-characterized model organisms, identification of regulatory regions within the vast non-coding genome remains a challenge. We used a large-scale assay to generate a genome wide map of transcriptional enhancers. Such a catalogue for the important malaria vector, Anopheles coluzzii , will be an important research tool as the role of non-coding regulatory variation in differential susceptibility to malaria infection is explored and as a public resource for research on this important insect vector of disease.
Tomkova, Marketa, Tomek, Jakub, Chow, Julie, McPherson, John D., Segal, David J., and Hormozdiari, Fereydoun. Dr.Nod: computational framework for discovery of regulatory non-coding drivers in tissue-matched distal regulatory elements. Nucleic Acids Research 51.4 Web. doi:10.1093/nar/gkac1251.
Tomkova, Marketa, Tomek, Jakub, Chow, Julie, McPherson, John D., Segal, David J., and Hormozdiari, Fereydoun.
"Dr.Nod: computational framework for discovery of regulatory non-coding drivers in tissue-matched distal regulatory elements". Nucleic Acids Research 51 (4). Country unknown/Code not available: Oxford University Press. https://doi.org/10.1093/nar/gkac1251.https://par.nsf.gov/biblio/10390548.
@article{osti_10390548,
place = {Country unknown/Code not available},
title = {Dr.Nod: computational framework for discovery of regulatory non-coding drivers in tissue-matched distal regulatory elements},
url = {https://par.nsf.gov/biblio/10390548},
DOI = {10.1093/nar/gkac1251},
abstractNote = {Abstract The discovery of cancer driver mutations is a fundamental goal in cancer research. While many cancer driver mutations have been discovered in the protein-coding genome, research into potential cancer drivers in the non-coding regions showed limited success so far. Here, we present a novel comprehensive framework Dr.Nod for detection of non-coding cis-regulatory candidate driver mutations that are associated with dysregulated gene expression using tissue-matched enhancer-gene annotations. Applying the framework to data from over 1500 tumours across eight tissues revealed a 4.4-fold enrichment of candidate driver mutations in regulatory regions of known cancer driver genes. An overarching conclusion that emerges is that the non-coding driver mutations contribute to cancer by significantly altering transcription factor binding sites, leading to upregulation of tissue-matched oncogenes and down-regulation of tumour-suppressor genes. Interestingly, more than half of the detected cancer-promoting non-coding regulatory driver mutations are over 20 kb distant from the cancer-associated genes they regulate. Our results show the importance of tissue-matched enhancer-gene maps, functional impact of mutations, and complex background mutagenesis model for the prediction of non-coding regulatory drivers. In conclusion, our study demonstrates that non-coding mutations in enhancers play a previously underappreciated role in cancer and dysregulation of clinically relevant target genes.},
journal = {Nucleic Acids Research},
volume = {51},
number = {4},
publisher = {Oxford University Press},
author = {Tomkova, Marketa and Tomek, Jakub and Chow, Julie and McPherson, John D. and Segal, David J. and Hormozdiari, Fereydoun},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.