Title: Learning genotype–phenotype associations from gaps in multi-species sequence alignments
Abstract Understanding the genetic basis of phenotypic variation is fundamental to biology. Here we introduce GAP, a novel machine learning framework for predicting binary phenotypes from gaps in multi-species sequence alignments. GAP employs a neural network to predict the presence or absence of phenotypes solely from alignment gaps, contrasting with existing tools that require additional and often inaccessible input data. GAP can be applied to three distinct problems: predicting phenotypes in species from known associated genomic regions, pinpointing positions within such regions that are important for predicting phenotypes, and extracting sets of candidate regions associated with phenotypes. We showcase the utility of GAP by exploiting the well-known association between the L-gulonolactone oxidase (Gulo) gene and vitamin C synthesis, demonstrating its perfect prediction accuracy in 34 vertebrates. This exceptional performance also applies more generally, with GAP achieving high accuracy and power on a large simulated dataset. Moreover, predictions of vitamin C synthesis in species with unknown status mirror their phylogenetic relationships, and positions with high predictive importance are consistent with those identified by previous studies. Last, a genome-wide application of GAP identifies many additional genes that may be associated with vitamin C synthesis, and analysis of these candidates uncovers functional enrichment for immunity, a widely recognized role of vitamin C. Hence, GAP represents a simple yet useful tool for predicting genotype–phenotype associations and addressing diverse evolutionary questions from data available in a broad range of study systems. more »« less
Piya, Antara Anika; DeGiorgio, Michael; Assis, Raquel
(, Genome Biology and Evolution)
Yi, Soojin
(Ed.)
Abstract Predicting gene expression divergence is integral to understanding the emergence of new biological functions and associated traits. Whereas several sophisticated methods have been developed for this task, their applications are either limited to duplicate genes or require expression data from more than two species. Thus, here we present PredIcting eXpression dIvergence (PiXi), the first machine learning framework for predicting gene expression divergence between single-copy orthologs in two species. PiXi models gene expression evolution as an Ornstein-Uhlenbeck process, and overlays this model with multi-layer neural network (NN), random forest, and support vector machine architectures for making predictions. It outputs the predicted class “conserved” or “diverged” for each pair of orthologs, as well as their predicted expression optima in the two species. We show that PiXi has high power and accuracy in predicting gene expression divergence between single-copy orthologs, as well as high accuracy and precision in estimating their expression optima in the two species, across a wide range of evolutionary scenarios, with the globally best performance achieved by a multi-layer NN. Moreover, application of our best-performing PiXi predictor to empirical gene expression data from single-copy orthologs residing at different loci in two species of Drosophila reveals that approximately 23% underwent expression divergence after positional relocation. Further analysis shows that several of these “diverged” genes are involved in the electron transport chain of the mitochondrial membrane, suggesting that new chromatin environments may impact energy production in Drosophila. Thus, by providing a toolkit for predicting gene expression divergence between single-copy orthologs in two species, PiXi can shed light on the origins of novel phenotypes across diverse biological processes and study systems.
Abstract Xylella fastidiosais a bacterium that infects crops like grapevines, coffee, almonds, citrus and olives. There is little understanding of the genes that contribute to plant resistance, the genomic architecture of resistance, and the potential role of climate in shaping resistance, in part because major crops like grapevines (Vitis vinifera) are not resistant to the bacterium. Here we study a wild grapevine species,V. arizonica, that segregates for resistance. Using genome-wide association, we identify candidate resistance genes. Resistance-associated kmers are shared with a sister species ofV. arizonicabut not with more distant species, suggesting that resistance evolved more than once. Finally, resistance is climate dependent, because individuals from low ( < 10 °C) temperature locations in the wettest quarter were typically susceptible to infection, likely reflecting a lack of pathogen pressure in colder climates. In fact, climate is as effective a predictor of resistance phenotypes as some genetic markers. We extend our climate observations to additional crops, predicting that increased pathogen pressure is more likely for grapevines and almonds than some other susceptible crops.
Suzuki, Masaharu; Wu, Shan; Mimura, Manaki; Alseekh, Saleh; Fernie, Alisdair R.; Hanson, Andrew D.; McCarty, Donald R.
(, The Plant Journal)
Summary The B vitamins provide essential co‐factors for central metabolism in all organisms. In plants, B vitamins have surprising emerging roles in development, stress tolerance and pathogen resistance. Hence, there is a paramount interest in understanding the regulation of vitamin biosynthesis as well as the consequences of vitamin deficiency in crop species. To facilitate genetic analysis of B vitamin biosynthesis and functions in maize, we have mined the UniformMu transposon resource to identify insertional mutations in vitamin pathway genes. A screen of 190 insertion lines for seed and seedling phenotypes identified mutations in biotin, pyridoxine and niacin biosynthetic pathways. Importantly, isolation of independent insertion alleles enabled genetic confirmation of genotype‐to‐phenotype associations. Because B vitamins are essential for survival, null mutations often have embryo lethal phenotypes that prevent elucidation of subtle, but physiologically important, metabolic consequences of sub‐optimal (functional) vitamin status. To circumvent this barrier, we demonstrate a strategy for refined genetic manipulation of vitamin status based on construction of heterozygotes that combine strong and hypomorphic mutant alleles. Dosage analysis ofpdx2alleles in endosperm revealed that endosperm supplies pyridoxine to the developing embryo. Similarly, a hypomorphicbio1allele enabled analysis of transcriptome and metabolome responses to incipient biotin deficiency in seedling leaves. We show that systemic pipecolic acid accumulation is an early metabolic response to sub‐optimal biotin status highlighting an intriguing connection between biotin, lysine metabolism and systemic disease resistance signaling. Seed‐stocks carrying insertions for vitamin pathway genes are available for free, public distribution via the Maize Genetics Cooperation Stock Center.
Morris, J.A.; Lickey, B.S.; Liptak, M.D.
(, Vitamins and hormones)
Litwack, G.
(Ed.)
Vitamin B12 is one of the most complex cofactors known, and this chapter will discuss current understanding with regards to the cobalt insertion step of its syntheses. Two total syntheses of vitamin B12 were reported in the 1970s, which remain two of the most exceptional achievements of natural product synthesis. In subsequent years, two distinct biosynthetic pathways were identified in aerobic and anaerobic organisms. For these biosynthetic pathways, selectivity for Co(II) over other divalent metal ions with similar ionic radii and coordination chemistry remains an open question with three competing hypotheses proposed: metal affinity, tetrapyrrole distortion, and product inhibition. A 20 step biosynthetic route to convert 5-aminolevulinic acid (ALA) to vitamin B12 was elucidated in aerobic organisms in the 1990s, where cobalt is inserted relatively late in the pathway by the CobNST multi-protein complex. This chapter includes a mechanistic proposal for this reaction, but the majority of the proposal is based upon analogy to the ChlDHI magnesium chelatase complex as critical data for the cobalt chelatase is lacking. Later, in the 2010s, a distinct 21 step pathway from ALA to vitamin B12 was reported in anaerobic organisms, where cobalt is inserted early in the pathway by the enzyme CbiK. A recent study strongly suggests that the cobalt affinity of CbiK is the origin of cobalt selectivity for CbiK, but several important mechanistic questions remain unanswered. In general, it is expected that significant insight into the cobalt insertion mechanisms of CobNST and CbiK could be derived from additional structural, spectroscopic, and computational data.
Kaplow, Irene M.; Lawler, Alyssa J.; Schäffer, Daniel E.; Srinivasan, Chaitanya; Sestili, Heather H.; Wirthlin, Morgan E.; Phan, BaDoi N.; Prasad, Kavya; Brown, Ashley R.; Zhang, Xiaomeng; et al
(, Science)
INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic]
Islam, Uwaise_Ibna, Campelo_dos_Santos, Andre_Luiz, Kanjilal, Ria, and Assis, Raquel.
"Learning genotype–phenotype associations from gaps in multi-species sequence alignments". Briefings in Bioinformatics 26 (1). Country unknown/Code not available: Oxford University Press. https://doi.org/10.1093/bib/bbaf022.https://par.nsf.gov/biblio/10572754.
@article{osti_10572754,
place = {Country unknown/Code not available},
title = {Learning genotype–phenotype associations from gaps in multi-species sequence alignments},
url = {https://par.nsf.gov/biblio/10572754},
DOI = {10.1093/bib/bbaf022},
abstractNote = {Abstract Understanding the genetic basis of phenotypic variation is fundamental to biology. Here we introduce GAP, a novel machine learning framework for predicting binary phenotypes from gaps in multi-species sequence alignments. GAP employs a neural network to predict the presence or absence of phenotypes solely from alignment gaps, contrasting with existing tools that require additional and often inaccessible input data. GAP can be applied to three distinct problems: predicting phenotypes in species from known associated genomic regions, pinpointing positions within such regions that are important for predicting phenotypes, and extracting sets of candidate regions associated with phenotypes. We showcase the utility of GAP by exploiting the well-known association between the L-gulonolactone oxidase (Gulo) gene and vitamin C synthesis, demonstrating its perfect prediction accuracy in 34 vertebrates. This exceptional performance also applies more generally, with GAP achieving high accuracy and power on a large simulated dataset. Moreover, predictions of vitamin C synthesis in species with unknown status mirror their phylogenetic relationships, and positions with high predictive importance are consistent with those identified by previous studies. Last, a genome-wide application of GAP identifies many additional genes that may be associated with vitamin C synthesis, and analysis of these candidates uncovers functional enrichment for immunity, a widely recognized role of vitamin C. Hence, GAP represents a simple yet useful tool for predicting genotype–phenotype associations and addressing diverse evolutionary questions from data available in a broad range of study systems.},
journal = {Briefings in Bioinformatics},
volume = {26},
number = {1},
publisher = {Oxford University Press},
author = {Islam, Uwaise_Ibna and Campelo_dos_Santos, Andre_Luiz and Kanjilal, Ria and Assis, Raquel},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.