skip to main content

Title: Species Tree Inference Methods Intended to Deal with Incomplete Lineage Sorting Are Robust to the Presence of Paralogs
Abstract Many recent phylogenetic methods have focused on accurately inferring species trees when there is gene tree discordance due to incomplete lineage sorting (ILS). For almost all of these methods, and for phylogenetic methods in general, the data for each locus are assumed to consist of orthologous, single-copy sequences. Loci that are present in more than a single copy in any of the studied genomes are excluded from the data. These steps greatly reduce the number of loci available for analysis. The question we seek to answer in this study is: what happens if one runs such species tree inference methods on data where paralogy is present, in addition to or without ILS being present? Through simulation studies and analyses of two large biological data sets, we show that running such methods on data with paralogs can still provide accurate results. We use multiple different methods, some of which are based directly on the multispecies coalescent model, and some of which have been proven to be statistically consistent under it. We also treat the paralogous loci in multiple ways: from explicitly denoting them as paralogs, to randomly selecting one copy per species. In all cases, the inferred species trees are as accurate as equivalent analyses using single-copy orthologs. Our results have significant implications for the use of ILS-aware phylogenomic analyses, demonstrating that they do not have to be restricted to single-copy loci. This will greatly increase the amount of data that can be used for phylogenetic inference.[Gene duplication and loss; incomplete lineage sorting; multispecies coalescent; orthology; paralogy.]  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ;
Kubatko, Laura
Date Published:
Journal Name:
Systematic Biology
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ruane, Sara (Ed.)
    Abstract Some phylogenetic problems remain unresolved even when large amounts of sequence data are analyzed and methods that accommodate processes such as incomplete lineage sorting are employed. In addition to investigating biological sources of phylogenetic incongruence, it is also important to reduce noise in the phylogenomic dataset by using appropriate filtering approach that addresses gene tree estimation errors. We present the results of a case study in manakins, focusing on the very difficult clade comprising the genera Antilophia and Chiroxiphia. Previous studies suggest that Antilophia is nested within Chiroxiphia, though relationships among Antilophia+Chiroxiphia species have been highly unstable. We extracted more than 11,000 loci (ultra-conserved elements and introns) from whole genomes and conducted analyses using concatenation and multispecies coalescent methods. Topologies resulting from analyses using all loci differed depending on the data type and analytical method, with 2 clades (Antilophia+Chiroxiphia and Manacus+Pipra+Machaeopterus) in the manakin tree showing incongruent results. We hypothesized that gene trees that conflicted with a long coalescent branch (e.g., the branch uniting Antilophia+Chiroxiphia) might be enriched for cases of gene tree estimation error, so we conducted analyses that either constrained those gene trees to include monophyly of Antilophia+Chiroxiphia or excluded these loci. While constraining trees reduced some incongruence, excluding the trees led to completely congruent species trees, regardless of the data type or model of sequence evolution used. We found that a suite of gene metrics (most importantly the number of informative sites and likelihood of intralocus recombination) collectively explained the loci that resulted in non-monophyly of Antilophia+Chiroxiphia. We also found evidence for introgression that may have contributed to the discordant topologies we observe in Antilophia+Chiroxiphia and led to deviations from expectations given the multispecies coalescent model. Our study highlights the importance of identifying factors that can obscure phylogenetic signal when dealing with recalcitrant phylogenetic problems, such as gene tree estimation error, incomplete lineage sorting, and reticulation events. [Birds; c-gene; data type; gene estimation error; model fit; multispecies coalescent; phylogenomics; reticulation] 
    more » « less
  2. Nielsen, Rasmus (Ed.)

    An important goal of evolutionary genomics is to identify genomic regions whose substitution rates differ among lineages. For example, genomic regions experiencing accelerated molecular evolution in some lineages may provide insight into links between genotype and phenotype. Several comparative genomics methods have been developed to identify genomic accelerations between species, including a Bayesian method called PhyloAcc, which models shifts in substitution rate in multiple target lineages on a phylogeny. However, few methods consider the possibility of discordance between the trees of individual loci and the species tree due to incomplete lineage sorting, which might cause false positives. Here, we present PhyloAcc-GT, which extends PhyloAcc by modeling gene tree heterogeneity. Given a species tree, we adopt the multispecies coalescent model as the prior distribution of gene trees, use Markov chain Monte Carlo (MCMC) for inference, and design novel MCMC moves to sample gene trees efficiently. Through extensive simulations, we show that PhyloAcc-GT outperforms PhyloAcc and other methods in identifying target lineage-specific accelerations and detecting complex patterns of rate shifts, and is robust to specification of population size parameters. PhyloAcc-GT is usually more conservative than PhyloAcc in calling convergent rate shifts because it identifies more accelerations on ancestral than on terminal branches. We apply PhyloAcc-GT to two examples of convergent evolution: flightlessness in ratites and marine mammal adaptations, and show that PhyloAcc-GT is a robust tool to identify shifts in substitution rate associated with specific target lineages while accounting for incomplete lineage sorting.

    more » « less
  3. Abstract

    With the continued adoption of genome‐scale data in evolutionary biology comes the challenge of adequately harnessing the information to make accurate phylogenetic inferences. Coalescent‐based methods of species tree inference have become common, and concatenation has been shown in simulation to perform well, particularly when levels of incomplete lineage sorting are low. However, simulation conditions are often overly simplistic, leaving empiricists with uncertainty regarding analytical tools. We use a large ultraconserved element data set (>3,000 loci) from rattlesnakes of theCrotalus triseriatusgroup to delimit lineages and estimate species trees using concatenation and several coalescent‐based methods. Unpartitioned and partitioned maximum likelihood and Bayesian analysis of the concatenated matrix yield a topology identical to coalescent analysis of a subset of the data inbpp. ASTRAL analysis on a subset of the more variable loci also results in a tree consistent with concatenation andbpp, whereas theSVDquartetsphylogeny differs at additional nodes. The size of the concatenated matrix has a strong effect on species tree inference usingSVDquartets, warranting additional investigation on optimal data characteristics for this method. Species delimitation analyses suggest up to 16 unique lineages may be present within theC. triseriatusgroup, with divergences occurring during the Neogene and Quaternary. Network analyses suggest hybridization within the group is relatively rare. Altogether, our results reaffirm the Mexican highlands as a biodiversity hotspot and suggest that coalescent‐based species tree inference on data subsets can provide a strongly supported species tree consistent with concatenation of all loci with a large amount of missing data.

    more » « less
  4. Abstract

    To examine phylogenetic heterogeneity in turtle evolution, we collected thousands of high-confidence single-copy orthologs from 19 genome assemblies representative of extant turtle diversity and estimated a phylogeny with multispecies coalescent and concatenated partitioned methods. We also collected next-generation sequences from 26 turtle species and assembled millions of biallelic markers to reconstruct phylogenies based on annotated regions from the western painted turtle (Chrysemys picta bellii) genome (coding regions, introns, untranslated regions, intergenic, and others). We then measured gene tree-species tree discordance, as well as gene and site heterogeneity at each node in the inferred trees, and tested for temporal patterns in phylogenomic conflict across turtle evolution. We found strong and consistent support for all bifurcations in the inferred turtle species phylogenies. However, a number of genes, sites, and genomic features supported alternate relationships between turtle taxa. Our results suggest that gene tree-species tree discordance in these data sets is likely driven by population-level processes such as incomplete lineage sorting. We found very little effect of substitutional saturation on species tree topologies, and no clear phylogenetic patterns in codon usage bias and compositional heterogeneity. There was no correlation between gene and site concordance, node age, and DNA substitution rate across most annotated genomic regions. Our study demonstrates that heterogeneity is to be expected even in well-resolved clades such as turtles, and that future phylogenomic studies should aim to sample as much of the genome as possible in order to obtain accurate phylogenies for assessing conservation priorities in turtles. [Discordance; genomes; phylogeny; turtles.]

    more » « less
  5. Abstract

    Abstract.—Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.

    more » « less