skip to main content

Title: Impact of homologous recombination on core genome phylogenies
Abstract Background

Core genome phylogenies are widely used to build the evolutionary history of individual prokaryote species. By using hundreds or thousands of shared genes, these approaches are the gold standard to reconstruct the relationships of large sets of strains. However, there is growing evidence that bacterial strains exchange DNA through homologous recombination at rates that vary widely across prokaryote species, indicating that core genome phylogenies might not be able to reconstruct true phylogenies when recombination rate is high. Few attempts have been made to evaluate the robustness of core genome phylogenies to recombination, but some analyses suggest that reconstructed trees are not always accurate.


In this study, we tested the robustness of core genome phylogenies to various levels of recombination rates. By analyzing simulated and empirical data, we observed that core genome phylogenies are relatively robust to recombination rates; nevertheless, our results suggest that many reconstructed trees are not completely accurate even when bootstrap supports are high. We found that some core genome phylogenies are highly robust to recombination whereas others are strongly impacted by it, and we identified that the robustness of core genome phylogenies to recombination is highly linked to the levels of selective pressures acting on a species. Stronger selective pressures lead to less accurate tree reconstructions, presumably because selective pressures more strongly bias the routes of DNA transfers, thereby causing phylogenetic artifacts.


Overall, these results have important implications for the application of core genome phylogenies in prokaryotes.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Genomics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    To examine phylogenetic heterogeneity in turtle evolution, we collected thousands of high-confidence single-copy orthologs from 19 genome assemblies representative of extant turtle diversity and estimated a phylogeny with multispecies coalescent and concatenated partitioned methods. We also collected next-generation sequences from 26 turtle species and assembled millions of biallelic markers to reconstruct phylogenies based on annotated regions from the western painted turtle (Chrysemys picta bellii) genome (coding regions, introns, untranslated regions, intergenic, and others). We then measured gene tree-species tree discordance, as well as gene and site heterogeneity at each node in the inferred trees, and tested for temporal patterns in phylogenomic conflict across turtle evolution. We found strong and consistent support for all bifurcations in the inferred turtle species phylogenies. However, a number of genes, sites, and genomic features supported alternate relationships between turtle taxa. Our results suggest that gene tree-species tree discordance in these data sets is likely driven by population-level processes such as incomplete lineage sorting. We found very little effect of substitutional saturation on species tree topologies, and no clear phylogenetic patterns in codon usage bias and compositional heterogeneity. There was no correlation between gene and site concordance, node age, and DNA substitution rate across most annotated genomic regions. Our study demonstrates that heterogeneity is to be expected even in well-resolved clades such as turtles, and that future phylogenomic studies should aim to sample as much of the genome as possible in order to obtain accurate phylogenies for assessing conservation priorities in turtles. [Discordance; genomes; phylogeny; turtles.]

    more » « less
  2. INTRODUCTION Resolving the role that different environmental forces may have played in the apparent explosive diversification of modern placental mammals is crucial to understanding the evolutionary context of their living and extinct morphological and genomic diversity. RATIONALE Limited access to whole-genome sequence alignments that sample living mammalian biodiversity has hampered phylogenomic inference, which until now has been limited to relatively small, highly constrained sequence matrices often representing <2% of a typical mammalian genome. To eliminate this sampling bias, we used an alignment of 241 whole genomes to comprehensively identify and rigorously analyze noncoding, neutrally evolving sequence variation in coalescent and concatenation-based phylogenetic frameworks. These analyses were followed by validation with multiple classes of phylogenetically informative structural variation. This approach enabled the generation of a robust time tree for placental mammals that evaluated age variation across hundreds of genomic loci that are not restricted by protein coding annotations. RESULTS Coalescent and concatenation phylogenies inferred from multiple treatments of the data were highly congruent, including support for higher-level taxonomic groupings that unite primates+colugos with treeshrews (Euarchonta), bats+cetartiodactyls+perissodactyls+carnivorans+pangolins (Scrotifera), all scrotiferans excluding bats (Fereuungulata), and carnivorans+pangolins with perissodactyls (Zooamata). However, because these approaches infer a single best tree, they mask signatures of phylogenetic conflict that result from incomplete lineage sorting and historical hybridization. Accordingly, we also inferred phylogenies from thousands of noncoding loci distributed across chromosomes with historically contrasting recombination rates. Throughout the radiation of modern orders (such as rodents, primates, bats, and carnivores), we observed notable differences between locus trees inferred from the autosomes and the X chromosome, a pattern typical of speciation with gene flow. We show that in many cases, previously controversial phylogenetic relationships can be reconciled by examining the distribution of conflicting phylogenetic signals along chromosomes with variable historical recombination rates. Lineage divergence time estimates were notably uniform across genomic loci and robust to extensive sensitivity analyses in which the underlying data, fossil constraints, and clock models were varied. The earliest branching events in the placental phylogeny coincide with the breakup of continental landmasses and rising sea levels in the Late Cretaceous. This signature of allopatric speciation is congruent with the low genomic conflict inferred for most superordinal relationships. By contrast, we observed a second pulse of diversification immediately after the Cretaceous-Paleogene (K-Pg) extinction event superimposed on an episode of rapid land emergence. Greater geographic continuity coupled with tumultuous climatic changes and increased ecological landscape at this time provided enhanced opportunities for mammalian diversification, as depicted in the fossil record. These observations dovetail with increased phylogenetic conflict observed within clades that diversified in the Cenozoic. CONCLUSION Our genome-wide analysis of multiple classes of sequence variation provides the most comprehensive assessment of placental mammal phylogeny, resolves controversial relationships, and clarifies the timing of mammalian diversification. We propose that the combination of Cretaceous continental fragmentation and lineage isolation, followed by the direct and indirect effects of the K-Pg extinction at a time of rapid land emergence, synergistically contributed to the accelerated diversification rate of placental mammals during the early Cenozoic. The timing of placental mammal evolution. Superordinal mammalian diversification took place in the Cretaceous during periods of continental fragmentation and sea level rise with little phylogenomic discordance (pie charts: left, autosomes; right, X chromosome), which is consistent with allopatric speciation. By contrast, the Paleogene hosted intraordinal diversification in the aftermath of the K-Pg mass extinction event, when clades exhibited higher phylogenomic discordance consistent with speciation with gene flow and incomplete lineage sorting. 
    more » « less
  3. Premise

    Recent advances in generating large‐scale phylogenies enable broad‐scale estimation of species diversification. These now common approaches typically are characterized by (1) incomplete species coverage without explicit sampling methodologies and/or (2) sparse backbone representation, and usually rely on presumed phylogenetic placements to account for species without molecular data. We used empirical examples to examine the effects of incomplete sampling on diversification estimation and provide constructive suggestions to ecologists and evolutionary biologists based on those results.


    We used a supermatrix for rosids and one well‐sampled subclade (Cucurbitaceae) as empirical case studies. We compared results using these large phylogenies with those based on a previously inferred, smaller supermatrix and on a synthetic tree resource with complete taxonomic coverage. Finally, we simulated random and representative taxon sampling and explored the impact of sampling on three commonly used methods, both parametric (RPANDA and BAMM) and semiparametric (DR).


    We found that the impact of sampling on diversification estimates was idiosyncratic and often strong. Compared to full empirical sampling, representative and random sampling schemes either depressed or inflated speciation rates, depending on methods and sampling schemes. No method was entirely robust to poor sampling, but BAMM was least sensitive to moderate levels of missing taxa.


    We suggest caution against uncritical modeling of missing taxa using taxonomic data for poorly sampled trees and in the use of summary backbone trees and other data sets with high representative bias, and we stress the importance of explicit sampling methodologies in macroevolutionary studies.

    more » « less
  4. Abstract Motivation

    The phylogenetic signal of structural variation informs a more comprehensive understanding of evolution. As (near-)complete genome assembly becomes more commonplace, the next methodological challenge for inferring genome rearrangement trees is the identification of syntenic blocks of orthologous sequences. In this article, we studied 94 reference quality genomes of primarily Mycobacterium tuberculosis (Mtb) isolates as a benchmark to evaluate these methods. The clonal nature of Mtb evolution, the manageable genome sizes, along with substantial levels of structural variation make this an ideal benchmarking dataset.


    We tested several methods for detecting homology and obtaining syntenic blocks and two methods for inferring phylogenies from them, then compared the resulting trees to the standard method’s tree, inferred from nucleotide substitutions. We found that, not only the choice of methods, but also their parameters can impact results, and that the tree inference method had less impact than the block determination method. Interestingly, a rearrangement tree based on blocks from the Cactus whole-genome aligner was fully compatible with the highly supported branches of the substitution-based tree, enabling the combination of the two into a high-resolution supertree. Overall, our results indicate that accurate trees can be inferred using genome rearrangements, but the choice of the methods for inferring homology requires care.

    Availability and implementation

    Analysis scripts and code written for this study are available at and

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  5. Background

    The páramo ecosystem, located above the timberline in the tropical Andes, has been the setting for some of the most dramatic plant radiations, and it is one of the world’s fastest evolving and most diverse high-altitude ecosystems. Today 144+ species of frailejones (subtribe Espeletiinae Cuatrec., Asteraceae) dominate the páramo. Frailejones have intrigued naturalists and botanists, not just for their appealing beauty and impressive morphological diversity, but also for their remarkable adaptations to the extremely harsh environmental conditions of the páramo. Previous attempts to reconstruct the evolutionary history of this group failed to resolve relationships among genera and species, and there is no agreement regarding the classification of the group. Thus, our goal was to reconstruct the phylogeny of the frailejones and to test the influence of the geography on it as a first step to understanding the patterns of radiation of these plants.


    Field expeditions in 70 páramos of Colombia and Venezuela resulted in 555 collected samples from 110 species. Additional material was obtained from herbarium specimens. Sequence data included nrDNA (ITS and ETS) and cpDNA (rpl16), for an aligned total of 2,954 bp. Fragment analysis was performed with AFLP data using 28 primer combinations and yielding 1,665 fragments. Phylogenies based on sequence data were reconstructed under maximum parsimony, maximum likelihood and Bayesian inference. The AFLP dataset employed minimum evolution analyses. A Monte Carlo permutation test was used to infer the influence of the geography on the phylogeny.


    Phylogenies reconstructed suggest that most genera are paraphyletic, but the phylogenetic signal may be misled by hybridization and incomplete lineage sorting. A tree with all the available molecular data shows two large clades: one of primarily Venezuelan species that includes a few neighboring Colombian species; and a second clade of only Colombian species. Results from the Monte Carlo permutation test suggests a very strong influence of the geography on the phylogenetic relationships. Venezuelan páramos tend to hold taxa that are more distantly-related to each other than Colombian páramos, where taxa are more closely-related to each other.


    Our data suggest the presence of two independent radiations: one in Venezuela and the other in Colombia. In addition, the current generic classification will need to be deeply revised. Analyses show a strong geographic structure in the phylogeny, with large clades grouped in hotspots of diversity at a regional scale, and in páramo localities at a local scale. Differences in the degrees of relatedness between sympatric species of Venezuelan and Colombian páramos may be explained because of the younger age of the latter páramos, and the lesser time for speciation of Espeletiinae in them.

    more » « less