skip to main content


Title: OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees
Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a s plitti n g a nd p runing procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.  more » « less
Award ID(s):
2110404
NSF-PAR ID:
10424964
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Hejnol, Andreas
Date Published:
Journal Name:
PLOS Biology
Volume:
20
Issue:
10
ISSN:
1545-7885
Page Range / eLocation ID:
e3001827
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In the age of next-generation sequencing, the number of loci available for phylogenetic analyses has increased by orders of magnitude. But despite this dramatic increase in the amount of data, some phylogenomic studies have revealed rampant gene-tree discordance that can be caused by many historical processes, such as rapid diversification, gene duplication, or reticulate evolution. We used a target enrichment approach to sample 400 single-copy nuclear genes and estimate the phylogenetic relationships of 13 genera in the lichen-forming family Lobariaceae to address the effect of data type (nucleotides and amino acids) and phylogenetic reconstruction method (concatenation and species tree approaches). Furthermore, we examined datasets for evidence of historical processes, such as rapid diversification and reticulate evolution. We found incongruence associated with sequence data types (nucleotide vs. amino acid sequences) and with different methods of phylogenetic reconstruction (species tree vs. concatenation). The resulting phylogenetic trees provided evidence for rapid and reticulate evolution based on extremely short branches in the backbone of the phylogenies. The observed rapid and reticulate diversifications may explain conflicts among gene trees and the challenges to resolving evolutionary relationships. Based on divergence times, the diversification at the backbone occurred near the Cretaceous-Paleogene (K-Pg) boundary (65 Mya) which is consistent with other rapid diversifications in the tree of life. Although some phylogenetic relationships within the Lobariaceae family remain with low support, even with our powerful phylogenomic dataset of up to 376 genes, our use of target-capturing data allowed for the novel exploration of the mechanisms underlying phylogenetic and systematic incongruence.

     
    more » « less
  2. Abstract

    In the last decade and a half, advances in genetic sequencing technologies have revolutionized systematics, transforming the field from studying morphological characters or a few genetic markers, to genomic datasets in the phylogenomic era. A plethora of molecular phylogenetic studies on many taxonomic groups have come about, converging on, or refuting prevailing morphology or legacy‐marker‐based hypotheses about evolutionary affinities. Spider systematics has been no exception to this transformation and the inter‐relationships of several groups have now been studied using genomic data. About 51 500 extant spider species have been described, all with a conservative body plan, but innumerable morphological and behavioural peculiarities. Inferring the spider tree of life using morphological data has been a challenging task. Molecular data have corroborated many hypotheses of higher‐level relationships, but also resulted in new groups that refute previous hypotheses. In this review, we discuss recent advances in the reconstruction of the spider tree of life and highlight areas where additional effort is needed with potential solutions. We base this review on the most comprehensive spider phylogeny to date, representing 131 of the 132 spider families. To achieve this sampling, we combined six Sanger‐based markers with newly generated and publicly available genome‐scale datasets. We find that some inferred relationships between major lineages of spiders (such as Austrochiloidea, Palpimanoidea and Synspermiata) are robust across different classes of data. However, several new hypotheses have emerged with different classes of molecular data. We identify and discuss the robust and controversial hypotheses and compile this blueprint to design future studies targeting systematic revisions of these problematic groups. We offer an evolutionary framework to explore comparative questions such as evolution of venoms, silk, webs, morphological traits and reproductive strategies.

     
    more » « less
  3. ABSTRACT

    Psocodea (booklice and parasitic lice) is an order of insects containing species with extensive mitochondrial genome rearrangements, particularly within the suborder Troctomorpha, in which some species possess an extremely fragmented mitochondrial genome with several small minichromosomes. In the remaining suborders of Psocodea, there are groups with the ancestral pancrustacean arrangement, quite extensive rearrangements (e.g. Trogiomorpha), or in which the small number of species analysed to date have rearrangements of only a few protein‐coding genes and/or tRNAs (e.g. Psocomorpha). Despite the apparent high rate of rearrangements in the order as a whole, a small number of complete mitochondrial genomes are available, especially for suborder Psocomorpha, the largest free‐living suborder. To understand the evolution of the gene arrangement of the mitochondrial genome within Psocomorpha and its phylogenetic implications, we assembled and analysed the mitochondrial genomes of 33 species of bark lice belonging to nine families in two infraorders. Within the infraorder Homilopsocidea, four families were analysed, mainly from Lachesillidae (which included 22 species of this family). Within the infraorder Caeciliusetae, seven species representing five families were analysed. Mitochondrial gene rearrangements were identified in seven of the nine families. Some of these rearrangements were unique to a single species, while some contained phylogenetic signal, being shared by related species. These rearrangements typically corresponded to transpositions and inversions of tRNAs, possibly caused by tandem duplication–random loss (TDRL) and/or recombination events. Phylogenetic analyses of mitochondrial gene sequences provided phylogenetic resolution for several branches of the tree, including monophyly of Lachesillinae. The genusHemicaeciliusEnderlein was found to be embedded within the genusLachesillaWestwood, rending the latter paraphyletic. Monophyly was also never recovered for Lachesillidae and Elipsocidae as currently defined. However, instability was observed for some higher level relationships within Psocomorpha, including the relationships among the major clades of Lachesillidae.

     
    more » « less
  4. Kolodny, Rachel (Ed.)
    Phylogenomic studies of prokaryotic taxa often assume conserved marker genes are homologous across their length. However, processes such as horizontal gene transfer or gene duplication and loss may disrupt this homology by recombining only parts of genes, causing gene fission or fusion. We show using simulation that it is necessary to delineate homology groups in a set of bacterial genomes without relying on gene annotations to define the boundaries of homologous regions. To solve this problem, we have developed a graph-based algorithm to partition a set of bacterial genomes into Maximal Homologous Groups of sequences ( MHGs ) where each MHG is a maximal set of maximum-length sequences which are homologous across the entire sequence alignment. We applied our algorithm to a dataset of 19 Enterobacteriaceae species and found that MHGs cover much greater proportions of genomes than markers and, relatedly, are less biased in terms of the functions of the genes they cover. We zoomed in on the correlation between each individual marker and their overlapping MHGs, and show that few phylogenetic splits supported by the markers are supported by the MHGs while many marker-supported splits are contradicted by the MHGs. A comparison of the species tree inferred from marker genes with the species tree inferred from MHGs suggests that the increased bias and lack of genome coverage by markers causes incorrect inferences as to the overall relationship between bacterial taxa. 
    more » « less
  5. null (Ed.)
    Abstract Target enrichment (such as Hyb-Seq) is a well-established high throughput sequencing method that has been increasingly used for phylogenomic studies. Unfortunately, current widely used pipelines for analysis of target enrichment data do not have a vigorous procedure to remove paralogs in target enrichment data. In this study, we develop a pipeline we call Putative Paralogs Detection (PPD) to better address putative paralogs from enrichment data. The new pipeline is an add-on to the existing HybPiper pipeline, and the entire pipeline applies criteria in both sequence similarity and heterozygous sites at each locus in the identification of paralogs. Users may adjust the thresholds of sequence identity and heterozygous sites to identify and remove paralogs according to the level of phylogenetic divergence of their group of interest. The new pipeline also removes highly polymorphic sites attributed to errors in sequence assembly and gappy regions in the alignment. We demonstrated the value of the new pipeline using empirical data generated from Hyb-Seq and the Angiosperm 353 kit for two woody genera Castanea (Fagaceae, Fagales) and Hamamelis (Hamamelidaceae, Saxifragales). Comparisons of datasets showed that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed evident differences between data from HybPiper and data from our new PPD pipeline. We further evaluated the accuracy and error rates of PPD by BLAST mapping of putative paralogous and orthologous sequences to a reference genome sequence of Castanea mollissima. Compared to HybPiper alone, PPD identified substantially more paralogous gene sequences that mapped to multiple regions of the reference genome (31 genes for PPD compared with 4 genes for HybPiper alone). In conjunction with HybPiper, paralogous genes identified by both pipelines can be removed resulting in the construction of more robust orthologous gene datasets for phylogenomic and divergence time analyses. Our study demonstrates the value of Hyb-Seq with data derived from the Angiosperm 353 probe set for elucidating species relationships within a genus, and argues for the importance of additional steps to filter paralogous genes and poorly aligned regions (e.g., as occur through assembly errors), such as our new PPD pipeline described in this study. 
    more » « less