skip to main content


Title: Predictive models of genetic redundancy in Arabidopsis thaliana
Abstract Genetic redundancy refers to a situation where an individual with a loss-of-function mutation in one gene (single mutant) does not show an apparent phenotype until one or more paralogs are also knocked out (double/higher-order mutant). Previous studies have identified some characteristics common among redundant gene pairs, but a predictive model of genetic redundancy incorporating a wide variety of features derived from accumulating omics and mutant phenotype data is yet to be established. In addition, the relative importance of these features for genetic redundancy remains largely unclear. Here, we establish machine learning models for predicting whether a gene pair is likely redundant or not in the model plant Arabidopsis thaliana based on six feature categories: functional annotations, evolutionary conservation including duplication patterns and mechanisms, epigenetic marks, protein properties including post-translational modifications, gene expression, and gene network properties. The definition of redundancy, data transformations, feature subsets, and machine learning algorithms used significantly affected model performance based on hold-out, testing phenotype data. Among the most important features in predicting gene pairs as redundant were having a paralog(s) from recent duplication events, annotation as a transcription factor, downregulation during stress conditions, and having similar expression patterns under stress conditions. We also explored the potential reasons underlying mispredictions and limitations of our studies. This genetic redundancy model sheds light on characteristics that may contribute to long-term maintenance of paralogs, and will ultimately allow for more targeted generation of functionally informative double mutants, advancing functional genomic studies.  more » « less
Award ID(s):
1655630 1655386
NSF-PAR ID:
10226049
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Editor(s):
de Meaux, Juliette
Date Published:
Journal Name:
Molecular Biology and Evolution
ISSN:
0737-4038
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Gene deletion is traditionally thought of as a nonadaptive process that removes functional redundancy from genomes, such that it generally receives less attention than duplication in evolutionary turnover studies. Yet, mounting evidence suggests that deletion may promote adaptation via the “less-is-more” evolutionary hypothesis, as it often targets genes harboring unique sequences, expression profiles, and molecular functions. Hence, predicting the relative prevalence of redundant and unique functions among genes targeted by deletion, as well as the parameters underlying their evolution, can shed light on the role of gene deletion in adaptation.

    Results

    Here, we present CLOUDe, a suite of machine learning methods for predicting evolutionary targets of gene deletion events from expression data. Specifically, CLOUDe models expression evolution as an Ornstein–Uhlenbeck process, and uses multi-layer neural network, extreme gradient boosting, random forest, and support vector machine architectures to predict whether deleted genes are “redundant” or “unique”, as well as several parameters underlying their evolution. We show that CLOUDe boasts high power and accuracy in differentiating between classes, and high accuracy and precision in estimating evolutionary parameters, with optimal performance achieved by its neural network architecture. Application of CLOUDe to empirical data from Drosophila suggests that deletion primarily targets genes with unique functions, with further analysis showing these functions to be enriched for protein deubiquitination. Thus, CLOUDe represents a key advance in learning about the role of gene deletion in functional evolution and adaptation.

    Availability and implementation

    CLOUDe is freely available on GitHub (https://github.com/anddssan/CLOUDe).

     
    more » « less
  2. Rogers, Rebekah (Ed.)
    Abstract Whole-genome duplications (WGDs) have shaped the gene repertoire of many eukaryotic lineages. The redundancy created by WGDs typically results in a phase of massive gene loss. However, some WGD–derived paralogs are maintained over long evolutionary periods, and the relative contributions of different selective pressures to their maintenance are still debated. Previous studies have revealed a history of three successive WGDs in the lineage of the ciliate Paramecium tetraurelia and two of its sister species from the Paramecium aurelia complex. Here, we report the genome sequence and analysis of 10 additional P. aurelia species and 1 additional out group, revealing aspects of post-WGD evolution in 13 species sharing a common ancestral WGD. Contrary to the morphological radiation of vertebrates that putatively followed two WGD events, members of the cryptic P. aurelia complex have remained morphologically indistinguishable after hundreds of millions of years. Biases in gene retention compatible with dosage constraints appear to play a major role opposing post-WGD gene loss across all 13 species. In addition, post-WGD gene loss has been slower in Paramecium than in other species having experienced genome duplication, suggesting that the selective pressures against post-WGD gene loss are especially strong in Paramecium. A near complete lack of recent single-gene duplications in Paramecium provides additional evidence for strong selective pressures against gene dosage changes. This exceptional data set of 13 species sharing an ancestral WGD and 2 closely related out group species will be a useful resource for future studies on Paramecium as a major model organism in the evolutionary cell biology. 
    more » « less
  3. Synopsis

    Gene duplicates, or paralogs, serve as a major source of new genetic material and comprise seeds for evolutionary innovation. While originally thought to be quickly lost or nonfunctionalized following duplication, now a vast number of paralogs are known to be retained in a functional state. Daughter paralogs can provide robustness through redundancy, specialize via sub-functionalization, or neo-functionalize to play new roles. Indeed, the duplication and divergence of developmental genes have played a monumental role in the evolution of animal forms (e.g., Hox genes). Still, despite their prevalence and evolutionary importance, the precise detection of gene duplicates in newly sequenced genomes remains technically challenging and often overlooked. This presents an especially pertinent problem for evolutionary developmental biology, where hypothesis testing requires accurate detection of changes in gene expression and function, often in nontraditional model species. Frequently, these analyses rely on molecular reagents designed within coding sequences that may be highly similar in recently duplicated paralogs, leading to cross-reactivity and spurious results. Thus, care is needed to avoid erroneously assigning diverged functions of paralogs to a single gene, and potentially misinterpreting evolutionary history. This perspective aims to overview the prevalence and importance of paralogs and to shed light on the difficulty of their detection and analysis while offering potential solutions.

     
    more » « less
  4. Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. UsingArabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220A. thalianagenes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.

     
    more » « less
  5. Forche, Anja (Ed.)

    TheCandida albicansgenome contains between ten and fifteen distinctTLOgenes that all encode a Med2 subunit of Mediator. In order to investigate the biological role of Med2/Tlo inC.albicanswe deleted all fourteenTLOgenes using CRISPR-Cas9 mutagenesis. ChIP-seq analysis showed that RNAP II localized to 55% fewer genes in thetloΔ mutant strain compared to the parent, while RNA-seq analysis showed that thetloΔ mutant exhibited differential expression of genes required for carbohydrate metabolism, stress responses, white-opaque switching and filamentous growth. Consequently, thetloΔ mutant grows poorly in glucose- and galactose-containing media, is unable to grow as true hyphae, is more sensitive to oxidative stress and is less virulent in the wax worm infection model. Reintegration of genes representative of the α-, β- and γ-TLOclades resulted in the complementation of the mutant phenotypes, but to different degrees.TLOα1could restore phenotypes and gene expression patterns similar to wild-type and was the strongest activator of glycolytic and Tye7-regulated gene expression. In contrast, the two γ-TLOgenes examined (i.e.,TLOγ5 and TLOγ11) had a far lower impact on complementing phenotypic and transcriptomic changes. Uniquely, expression ofTLOβ2in thetloΔmutant stimulated filamentous growth in YEPD medium and this phenotype was enhanced when Tloβ2 expression was increased to levels far in excess of Med3. In contrast, expression of reintegratedTLOgenes in atloΔ/med3Δdouble mutant background failed to restore any of the phenotypes tested, suggesting that complementation of these Tlo-regulated processes requires a functional Mediator tail module. Together, these data confirm the importance of Med2/Tlo in a wide range ofC.albicanscellular activities and demonstrate functional diversity within the gene family which may contribute to the success of this yeast as a coloniser and pathogen of humans.

     
    more » « less