It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.
more »
« less
Efficient Inference of Macrophylogenies: Insights from the Avian Tree of Life
Abstract The exponential growth of molecular sequence data over the past decade has enabled the construction of numerous clade-specific phylogenies encompassing hundreds or thousands of taxa. These independent studies often include overlapping data, presenting a unique opportunity to build macrophylogenies (phylogenies sampling >1000 taxa) for entire classes across the Tree of Life. However, the inference of large trees remains constrained by logistical, computational, and methodological challenges. The Avian Tree of Life provides an ideal model for evaluating strategies to robustly infer macrophylogenies from intersecting data sets derived from smaller studies. In this study, we leveraged a comprehensive resource of sequence capture data sets to evaluate the phylogenetic accuracy and computational costs of four methodological approaches: (1) supermatrix approaches using concatenation, including the “fast” maximum likelihood (ML) methods, (2) filtering data sets to reduce heterogeneity, (3) supertree estimation based on published phylogenomic trees, and (4) a “divide-and-conquer” strategy, wherein smaller ML trees were estimated and subsequently combined using a supertree approach. Additionally, we examined the impact of these methods on divergence time estimation using a data set that includes newly vetted fossil calibrations for the Avian Tree of Life. Our findings highlight the advantages of recently developed fast tree search approaches initiated with parsimony starting trees, which offer a reasonable compromise between computational efficiency and phylogenetic accuracy, facilitating inference of macrophylogenies.
more »
« less
- Award ID(s):
- 1655624
- PAR ID:
- 10674298
- Author(s) / Creator(s):
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Editor(s):
- Thomson, Robert
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Systematic Biology
- ISSN:
- 1063-5157
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Reconstructing the global Tree of Life necessitates computational approaches to integrate numerous molecular phylogenies with limited species overlap into a comprehensive supertree. Our survey of published literature shows that individual phylogenies are frequently restricted to specific taxonomic groups due to investigators’ expertise and molecular evolutionary considerations, resulting in any given species present in a minuscule fraction of phylogenies. We present a novel approach, called the chronological supertree algorithm (Chrono-STA), that can build a supertree of species from such data by using node ages in published molecular phylogenies scaled to time. Chrono-STA builds a supertree by integrating chronological data from molecular timetrees. It fundamentally differs from existing approaches that generate consensus phylogenies from gene trees with missing taxa, as Chrono-STA does not impute nodal distances, use a guide tree as a backbone, or reduce phylogenies to quartets. Analyses of simulated and empirical datasets show that Chrono-STA can combine taxonomically restricted timetrees with extremely limited species overlap. For such data, approaches that impute missing distances or assemble phylogenetic quartets did not perform well. We conclude that integrating phylogenies via temporal dimension enhances the accuracy of reconstructed supertrees that are also scaled to time.more » « less
-
Though large multilocus genomic data sets have led to overall improvements in phylogenetic inference, they have posed the new challenge of addressing conflicting signals across the genome. In particular, ancestral population structure, which has been uncovered in a number of diverse species, can skew gene tree frequencies, thereby hindering the performance of species tree estimators. Here we develop a novel maximum likelihood method, termed TASTI (Taxa with Ancestral structure Species Tree Inference), that can infer phylogenies under such scenarios, and find that it has increasing accuracy with increasing numbers of input gene trees, contrasting with the relatively poor performances of methods not tailored for ancestral structure. Moreover, we propose a supertree approach that allows TASTI to scale computationally with increasing numbers of input taxa. We use genetic simulations to assess TASTI’s performance in the three- and four-taxon settings and demonstrate the application of TASTI on a six-species Afrotropical mosquito data set. Finally, we have implemented TASTI in an open-source software package for ease of use by the scientific community.more » « less
-
Davalos, Liliana (Ed.)Abstract Logical character dependency is a major conceptual and methodological problem in phylogenetic inference of morphological data sets, as it violates the assumption of character independence that is common to all phylogenetic methods. It is more frequently observed in higher-level phylogenies or in data sets characterizing major evolutionary transitions, as these represent parts of the tree of life where (primary) anatomical characters either originate or disappear entirely. As a result, secondary traits related to these primary characters become “inapplicable” across all sampled taxa in which that character is absent. Various solutions have been explored over the last three decades to handle character dependency, such as alternative character coding schemes and, more recently, new algorithmic implementations. However, the accuracy of the proposed solutions, or the impact of character dependency across distinct optimality criteria, has never been directly tested using standard performance measures. Here, we utilize simple and complex simulated morphological data sets analyzed under different maximum parsimony optimization procedures and Bayesian inference to test the accuracy of various coding and algorithmic solutions to character dependency. This is complemented by empirical analyses using a recoded data set on palaeognathid birds. We find that in small, simulated data sets, absent coding performs better than other popular coding strategies available (contingent and multistate), whereas in more complex simulations (larger data sets controlled for different tree structure and character distribution models) contingent coding is favored more frequently. Under contingent coding, a recently proposed weighting algorithm produces the most accurate results for maximum parsimony. However, Bayesian inference outperforms all parsimony-based solutions to handle character dependency due to fundamental differences in their optimization procedures—a simple alternative that has been long overlooked. Yet, we show that the more primary characters bearing secondary (dependent) traits there are in a data set, the harder it is to estimate the true phylogenetic tree, regardless of the optimality criterion, owing to a considerable expansion of the tree parameter space. [Bayesian inference, character dependency, character coding, distance metrics, morphological phylogenetics, maximum parsimony, performance, phylogenetic accuracy.]more » « less
-
Liliana Davalos (Ed.)Logical character dependency is a major conceptual and methodological problem in phylogenetic inference of morphological data sets, as it violates the assumption of character independence that is common to all phylogenetic methods. It is more frequently observed in higher-level phylogenies or in data sets characterizing major evolutionary transitions, as these represent parts of the tree of life where (primary) anatomical characters either originate or disappear entirely. As a result, secondary traits related to these primary characters become “inapplicable” across all sampled taxa in which that character is absent. Various solutions have been explored over the last three decades to handle character dependency, such as alternative character coding schemes and, more recently, new algorithmic implementations. However, the accuracy of the proposed solutions, or the impact of character dependency across distinct optimality criteria, has never been directly tested using standard performance measures. Here, we utilize simple and complex simulated morphological data sets analyzed under different maximum parsimony optimization procedures and Bayesian inference to test the accuracy of various coding and algorithmic solutions to character dependency. This is complemented by empirical analyses using a recoded data set on palaeognathid birds. We find that in small, simulated data sets, absent coding performs better than other popular coding strategies available (contingent and multistate), whereas in more complex simulations (larger data sets controlled for different tree structure and character distribution models) contingent coding is favored more frequently. Under contingent coding, a recently proposed weighting algorithm produces the most accurate results for maximum parsimony. However, Bayesian inference outperforms all parsimony-based solutions to handle character dependency due to fundamental differences in their optimization procedures—a simple alternative that has been long overlooked. Yet, we show that the more primary characters bearing secondary (dependent) traits there are in a data set, the harder it is to estimate the true phylogenetic tree, regardless of the optimality criterion, owing to a considerable expansion of the tree parameter space.more » « less
An official website of the United States government

