skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees
Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a s plitti n g a nd p runing procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.  more » « less
Award ID(s):
2110404
PAR ID:
10424964
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Hejnol, Andreas
Date Published:
Journal Name:
PLOS Biology
Volume:
20
Issue:
10
ISSN:
1545-7885
Page Range / eLocation ID:
e3001827
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Comparative genomics and molecular phylogenetics are foundational for understanding biological evolution. Although many studies have been made with the aim of understanding the genomic contents of early life, uncertainty remains. A study by Weiss et al. (Weiss MC, Sousa FL, Mrnjavac N, Neukirchen S, Roettger M, Nelson-Sathi S, Martin WF. 2016. The physiology and habitat of the last universal common ancestor. Nat Microbiol. 1(9):16116.) identified a number of protein families in the last universal common ancestor of archaea and bacteria (LUCA) which were not found in previous works. Here, we report new research that suggests the clustering approaches used in this previous study undersampled protein families, resulting in incomplete phylogenetic trees which do not reflect protein family evolution. Phylogenetic analysis of protein families which include more sequence homologs rejects a simple LUCA hypothesis based on phylogenetic separation of the bacterial and archaeal domains for a majority of the previously identified LUCA proteins (∼82%). To supplement limitations of phylogenetic inference derived from incompletely populated orthologous groups and to test the hypothesis of a period of rapid evolution preceding the separation of the domains, we compared phylogenetic distances both within and between domains, for thousands of orthologous groups. We find a substantial diversity of interdomain versus intradomain branch lengths, even among protein families which exhibit a single domain separating branch and are thought to be associated with the LUCA. Additionally, phylogenetic trees with long interdomain branches relative to intradomain branches are enriched in information categories of protein families in comparison to those associated with metabolic functions. These results provide a new view of protein family evolution and temper claims about the phenotype and habitat of the LUCA. 
    more » « less
  2. Kolodny, Rachel (Ed.)
    Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences ( MHGs ) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa. 
    more » « less
  3. Abstract In the last decade and a half, advances in genetic sequencing technologies have revolutionized systematics, transforming the field from studying morphological characters or a few genetic markers, to genomic datasets in the phylogenomic era. A plethora of molecular phylogenetic studies on many taxonomic groups have come about, converging on, or refuting prevailing morphology or legacy‐marker‐based hypotheses about evolutionary affinities. Spider systematics has been no exception to this transformation and the inter‐relationships of several groups have now been studied using genomic data. About 51 500 extant spider species have been described, all with a conservative body plan, but innumerable morphological and behavioural peculiarities. Inferring the spider tree of life using morphological data has been a challenging task. Molecular data have corroborated many hypotheses of higher‐level relationships, but also resulted in new groups that refute previous hypotheses. In this review, we discuss recent advances in the reconstruction of the spider tree of life and highlight areas where additional effort is needed with potential solutions. We base this review on the most comprehensive spider phylogeny to date, representing 131 of the 132 spider families. To achieve this sampling, we combined six Sanger‐based markers with newly generated and publicly available genome‐scale datasets. We find that some inferred relationships between major lineages of spiders (such as Austrochiloidea, Palpimanoidea and Synspermiata) are robust across different classes of data. However, several new hypotheses have emerged with different classes of molecular data. We identify and discuss the robust and controversial hypotheses and compile this blueprint to design future studies targeting systematic revisions of these problematic groups. We offer an evolutionary framework to explore comparative questions such as evolution of venoms, silk, webs, morphological traits and reproductive strategies. 
    more » « less
  4. ABSTRACT Psocodea (booklice and parasitic lice) is an order of insects containing species with extensive mitochondrial genome rearrangements, particularly within the suborder Troctomorpha, in which some species possess an extremely fragmented mitochondrial genome with several small minichromosomes. In the remaining suborders of Psocodea, there are groups with the ancestral pancrustacean arrangement, quite extensive rearrangements (e.g. Trogiomorpha), or in which the small number of species analysed to date have rearrangements of only a few protein‐coding genes and/or tRNAs (e.g. Psocomorpha). Despite the apparent high rate of rearrangements in the order as a whole, a small number of complete mitochondrial genomes are available, especially for suborder Psocomorpha, the largest free‐living suborder. To understand the evolution of the gene arrangement of the mitochondrial genome within Psocomorpha and its phylogenetic implications, we assembled and analysed the mitochondrial genomes of 33 species of bark lice belonging to nine families in two infraorders. Within the infraorder Homilopsocidea, four families were analysed, mainly from Lachesillidae (which included 22 species of this family). Within the infraorder Caeciliusetae, seven species representing five families were analysed. Mitochondrial gene rearrangements were identified in seven of the nine families. Some of these rearrangements were unique to a single species, while some contained phylogenetic signal, being shared by related species. These rearrangements typically corresponded to transpositions and inversions of tRNAs, possibly caused by tandem duplication–random loss (TDRL) and/or recombination events. Phylogenetic analyses of mitochondrial gene sequences provided phylogenetic resolution for several branches of the tree, including monophyly of Lachesillinae. The genusHemicaeciliusEnderlein was found to be embedded within the genusLachesillaWestwood, rending the latter paraphyletic. Monophyly was also never recovered for Lachesillidae and Elipsocidae as currently defined. However, instability was observed for some higher level relationships within Psocomorpha, including the relationships among the major clades of Lachesillidae. 
    more » « less
  5. Ruane, Sara (Ed.)
    Abstract Genome-scale data have the potential to clarify phylogenetic relationships across the tree of life but have also revealed extensive gene tree conflict. This seeming paradox, whereby larger data sets both increase statistical confidence and uncover significant discordance, suggests that understanding sources of conflict is important for accurate reconstruction of evolutionary history. We explore this paradox in squamate reptiles, the vertebrate clade comprising lizards, snakes, and amphisbaenians. We collected an average of 5103 loci for 91 species of squamates that span higher-level diversity within the clade, which we augmented with publicly available sequences for an additional 17 taxa. Using a locus-by-locus approach, we evaluated support for alternative topologies at 17 contentious nodes in the phylogeny. We identified shared properties of conflicting loci, finding that rate and compositional heterogeneity drives discordance between gene trees and species tree and that conflicting loci rarely overlap across contentious nodes. Finally, by comparing our tests of nodal conflict to previous phylogenomic studies, we confidently resolve 9 of the 17 problematic nodes. We suggest this locus-by-locus and node-by-node approach can build consensus on which topological resolutions remain uncertain in phylogenomic studies of other contentious groups. [Anchored hybrid enrichment (AHE); gene tree conflict; molecular evolution; phylogenomic concordance; target capture; ultraconserved elements (UCE).] 
    more » « less