skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning genotype–phenotype associations from gaps in multi-species sequence alignments
Abstract Understanding the genetic basis of phenotypic variation is fundamental to biology. Here we introduce GAP, a novel machine learning framework for predicting binary phenotypes from gaps in multi-species sequence alignments. GAP employs a neural network to predict the presence or absence of phenotypes solely from alignment gaps, contrasting with existing tools that require additional and often inaccessible input data. GAP can be applied to three distinct problems: predicting phenotypes in species from known associated genomic regions, pinpointing positions within such regions that are important for predicting phenotypes, and extracting sets of candidate regions associated with phenotypes. We showcase the utility of GAP by exploiting the well-known association between the L-gulonolactone oxidase (Gulo) gene and vitamin C synthesis, demonstrating its perfect prediction accuracy in 34 vertebrates. This exceptional performance also applies more generally, with GAP achieving high accuracy and power on a large simulated dataset. Moreover, predictions of vitamin C synthesis in species with unknown status mirror their phylogenetic relationships, and positions with high predictive importance are consistent with those identified by previous studies. Last, a genome-wide application of GAP identifies many additional genes that may be associated with vitamin C synthesis, and analysis of these candidates uncovers functional enrichment for immunity, a widely recognized role of vitamin C. Hence, GAP represents a simple yet useful tool for predicting genotype–phenotype associations and addressing diverse evolutionary questions from data available in a broad range of study systems.  more » « less
Award ID(s):
2130666
PAR ID:
10572754
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
26
Issue:
1
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Yi, Soojin (Ed.)
    Abstract Predicting gene expression divergence is integral to understanding the emergence of new biological functions and associated traits. Whereas several sophisticated methods have been developed for this task, their applications are either limited to duplicate genes or require expression data from more than two species. Thus, here we present PredIcting eXpression dIvergence (PiXi), the first machine learning framework for predicting gene expression divergence between single-copy orthologs in two species. PiXi models gene expression evolution as an Ornstein-Uhlenbeck process, and overlays this model with multi-layer neural network (NN), random forest, and support vector machine architectures for making predictions. It outputs the predicted class “conserved” or “diverged” for each pair of orthologs, as well as their predicted expression optima in the two species. We show that PiXi has high power and accuracy in predicting gene expression divergence between single-copy orthologs, as well as high accuracy and precision in estimating their expression optima in the two species, across a wide range of evolutionary scenarios, with the globally best performance achieved by a multi-layer NN. Moreover, application of our best-performing PiXi predictor to empirical gene expression data from single-copy orthologs residing at different loci in two species of Drosophila reveals that approximately 23% underwent expression divergence after positional relocation. Further analysis shows that several of these “diverged” genes are involved in the electron transport chain of the mitochondrial membrane, suggesting that new chromatin environments may impact energy production in Drosophila. Thus, by providing a toolkit for predicting gene expression divergence between single-copy orthologs in two species, PiXi can shed light on the origins of novel phenotypes across diverse biological processes and study systems. 
    more » « less
  2. Abstract Xylella fastidiosais a bacterium that infects crops like grapevines, coffee, almonds, citrus and olives. There is little understanding of the genes that contribute to plant resistance, the genomic architecture of resistance, and the potential role of climate in shaping resistance, in part because major crops like grapevines (Vitis vinifera) are not resistant to the bacterium. Here we study a wild grapevine species,V. arizonica, that segregates for resistance. Using genome-wide association, we identify candidate resistance genes. Resistance-associated kmers are shared with a sister species ofV. arizonicabut not with more distant species, suggesting that resistance evolved more than once. Finally, resistance is climate dependent, because individuals from low ( < 10 °C) temperature locations in the wettest quarter were typically susceptible to infection, likely reflecting a lack of pathogen pressure in colder climates. In fact, climate is as effective a predictor of resistance phenotypes as some genetic markers. We extend our climate observations to additional crops, predicting that increased pathogen pressure is more likely for grapevines and almonds than some other susceptible crops. 
    more » « less
  3. Summary The B vitamins provide essential co‐factors for central metabolism in all organisms. In plants, B vitamins have surprising emerging roles in development, stress tolerance and pathogen resistance. Hence, there is a paramount interest in understanding the regulation of vitamin biosynthesis as well as the consequences of vitamin deficiency in crop species. To facilitate genetic analysis of B vitamin biosynthesis and functions in maize, we have mined the UniformMu transposon resource to identify insertional mutations in vitamin pathway genes. A screen of 190 insertion lines for seed and seedling phenotypes identified mutations in biotin, pyridoxine and niacin biosynthetic pathways. Importantly, isolation of independent insertion alleles enabled genetic confirmation of genotype‐to‐phenotype associations. Because B vitamins are essential for survival, null mutations often have embryo lethal phenotypes that prevent elucidation of subtle, but physiologically important, metabolic consequences of sub‐optimal (functional) vitamin status. To circumvent this barrier, we demonstrate a strategy for refined genetic manipulation of vitamin status based on construction of heterozygotes that combine strong and hypomorphic mutant alleles. Dosage analysis ofpdx2alleles in endosperm revealed that endosperm supplies pyridoxine to the developing embryo. Similarly, a hypomorphicbio1allele enabled analysis of transcriptome and metabolome responses to incipient biotin deficiency in seedling leaves. We show that systemic pipecolic acid accumulation is an early metabolic response to sub‐optimal biotin status highlighting an intriguing connection between biotin, lysine metabolism and systemic disease resistance signaling. Seed‐stocks carrying insertions for vitamin pathway genes are available for free, public distribution via the Maize Genetics Cooperation Stock Center. 
    more » « less
  4. Litwack, G. (Ed.)
    Vitamin B12 is one of the most complex cofactors known, and this chapter will discuss current understanding with regards to the cobalt insertion step of its syntheses. Two total syntheses of vitamin B12 were reported in the 1970s, which remain two of the most exceptional achievements of natural product synthesis. In subsequent years, two distinct biosynthetic pathways were identified in aerobic and anaerobic organisms. For these biosynthetic pathways, selectivity for Co(II) over other divalent metal ions with similar ionic radii and coordination chemistry remains an open question with three competing hypotheses proposed: metal affinity, tetrapyrrole distortion, and product inhibition. A 20 step biosynthetic route to convert 5-aminolevulinic acid (ALA) to vitamin B12 was elucidated in aerobic organisms in the 1990s, where cobalt is inserted relatively late in the pathway by the CobNST multi-protein complex. This chapter includes a mechanistic proposal for this reaction, but the majority of the proposal is based upon analogy to the ChlDHI magnesium chelatase complex as critical data for the cobalt chelatase is lacking. Later, in the 2010s, a distinct 21 step pathway from ALA to vitamin B12 was reported in anaerobic organisms, where cobalt is inserted early in the pathway by the enzyme CbiK. A recent study strongly suggests that the cobalt affinity of CbiK is the origin of cobalt selectivity for CbiK, but several important mechanistic questions remain unanswered. In general, it is expected that significant insight into the cobalt insertion mechanisms of CobNST and CbiK could be derived from additional structural, spectroscopic, and computational data. 
    more » « less
  5. Lengauer, Thomas (Ed.)
    Abstract MotivationMicrobial signatures in the human microbiome are closely associated with various human diseases, driving the development of machine learning models for microbiome-based disease prediction. Despite progress, challenges remain in enhancing prediction accuracy, generalizability, and interpretability. Confounding factors, such as host’s gender, age, and body mass index, significantly influence the human microbiome, complicating microbiome-based predictions. ResultsTo address these challenges, we developed MicroKPNN-MT, a unified model for predicting human phenotype based on microbiome data, as well as additional metadata like age and gender. This model builds on our earlier MicroKPNN framework, which incorporates prior knowledge of microbial species into neural networks to enhance prediction accuracy and interpretability. In MicroKPNN-MT, metadata, when available, serves as additional input features for prediction. Otherwise, the model predicts metadata from microbiome data using additional decoders. We applied MicroKPNN-MT to microbiome data collected in mBodyMap, covering healthy individuals and 25 different diseases, and demonstrated its potential as a predictive tool for multiple diseases, which at the same time provided predictions for the missing metadata. Our results showed that incorporating real or predicted metadata helped improve the accuracy of disease predictions, and more importantly, helped improve the generalizability of the predictive models. Availability and implementationhttps://github.com/mgtools/MicroKPNN-MT. 
    more » « less