skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Estimating rates and patterns of diversification with incomplete sampling: a case study in the rosids
PremiseRecent advances in generating large‐scale phylogenies enable broad‐scale estimation of species diversification. These now common approaches typically are characterized by (1) incomplete species coverage without explicit sampling methodologies and/or (2) sparse backbone representation, and usually rely on presumed phylogenetic placements to account for species without molecular data. We used empirical examples to examine the effects of incomplete sampling on diversification estimation and provide constructive suggestions to ecologists and evolutionary biologists based on those results. MethodsWe used a supermatrix for rosids and one well‐sampled subclade (Cucurbitaceae) as empirical case studies. We compared results using these large phylogenies with those based on a previously inferred, smaller supermatrix and on a synthetic tree resource with complete taxonomic coverage. Finally, we simulated random and representative taxon sampling and explored the impact of sampling on three commonly used methods, both parametric (RPANDA and BAMM) and semiparametric (DR). ResultsWe found that the impact of sampling on diversification estimates was idiosyncratic and often strong. Compared to full empirical sampling, representative and random sampling schemes either depressed or inflated speciation rates, depending on methods and sampling schemes. No method was entirely robust to poor sampling, but BAMM was least sensitive to moderate levels of missing taxa. ConclusionsWe suggest caution against uncritical modeling of missing taxa using taxonomic data for poorly sampled trees and in the use of summary backbone trees and other data sets with high representative bias, and we stress the importance of explicit sampling methodologies in macroevolutionary studies.  more » « less
Award ID(s):
1916632
PAR ID:
10457434
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
American Journal of Botany
Volume:
107
Issue:
6
ISSN:
0002-9122
Format(s):
Medium: X Size: p. 895-909
Size(s):
p. 895-909
Sponsoring Org:
National Science Foundation
More Like this
  1. O'Connell, Mary (Ed.)
    Abstract The data available for reconstructing molecular phylogenies have become wildly disparate. Phylogenomic studies can generate data for thousands of genetic markers for dozens of species, but for hundreds of other taxa, data may be available from only a few genes. Can these two types of data be integrated to combine the advantages of both, addressing the relationships of hundreds of species with thousands of genes? Here, we show that this is possible, using data from frogs. We generated a phylogenomic data set for 138 ingroup species and 3,784 nuclear markers (ultraconserved elements [UCEs]), including new UCE data from 70 species. We also assembled a supermatrix data set, including data from 97% of frog genera (441 total), with 1–307 genes per taxon. We then produced a combined phylogenomic–supermatrix data set (a “gigamatrix”) containing 441 ingroup taxa and 4,091 markers but with 86% missing data overall. Likelihood analysis of the gigamatrix yielded a generally well-supported tree among families, largely consistent with trees from the phylogenomic data alone. All terminal taxa were placed in the expected families, even though 42.5% of these taxa each had >99.5% missing data and 70.2% had >90% missing data. Our results show that missing data need not be an impediment to successfully combining very large phylogenomic and supermatrix data sets, and they open the door to new studies that simultaneously maximize sampling of genes and taxa. 
    more » « less
  2. Abstract PremiseTo date, phylogenetic relationships within the monogeneric Brunelliaceae have been based on morphological evidence, which does not provide sufficient phylogenetic resolution. Here we use target‐enriched nuclear data to improve our understanding of phylogenetic relationships in the family. MethodsWe used the Angiosperms353 toolkit for targeted recovery of exonic regions and supercontigs (exons + introns) from low copy nuclear genes from 53 of 70 species inBrunellia, and several outgroup taxa. We removed loci that indicated biased inference of relationships and applied concatenated and coalescent methods to inferBrunelliaphylogeny. We identified conflicts among gene trees that may reflect hybridization or incomplete lineage sorting events and assessed their impact on phylogenetic inference. Finally, we performed ancestral‐state reconstructions of morphological traits and assessed the homology of character states used to define sections and subsections inBrunellia. ResultsBrunelliacomprises two major clades and several subclades. Most of these clades/subclades do not correspond to previous infrageneric taxa. There is high topological incongruence among the subclades across analyses. ConclusionsPhylogenetic reconstructions point to rapid species diversification in Brunelliaceae, reflected in very short branches between successive species splits. The removal of putatively biased loci slightly improves phylogenetic support for individual clades. Reticulate evolution due to hybridization and/or incomplete lineage sorting likely both contribute to gene‐tree discordance. Morphological characters used to define taxa in current classification schemes are homoplastic in the ancestral character‐state reconstructions. While target enrichment data allows us to broaden our understanding of diversification inBrunellia, the relationships among subclades remain incompletely understood. 
    more » « less
  3. Abstract Time‐scaled phylogenies underpin the interrogation of evolutionary processes across deep timescales, as well as attempts to link these to Earth's history. By inferring the placement of fossils and using their ages as temporal constraints, tip dating under the fossilized birth–death (FBD) process provides a coherent prior on divergence times. At the same time, it also links topological and temporal accuracy, as incorrectly placed fossil terminals should misinform divergence times. This could pose serious issues for obtaining accurate node ages, yet the interaction between topological and temporal error has not been thoroughly explored. We simulate phylogenies and associated morphological datasets using methodologies that incorporate evolution under selection, and are benchmarked against empirical datasets. We find that datasets of 300 characters and realistic levels of missing data generally succeed in inferring the correct placement of fossils on a constrained extant backbone topology, and that true node ages are usually contained within Bayesian posterior distributions. While increased fossil sampling improves the accuracy of inferred ages, topological and temporal errors do not seem to be linked: analyses in which fossils resolve less accurately do not exhibit elevated errors in node age estimates. At the same time, inferred divergence times are biased, probably due to a mismatch between the FBD prior and the shape of our simulated trees. While these results are encouraging, suggesting that even fossils with uncertain affinities can provide useful temporal information, they also emphasize that palaeontological information cannot overturn discrepancies between model priors and the true diversification history. 
    more » « less
  4. Abstract Estimating speciation and extinction rates is essential for understanding past and present biodiversity, but is challenging given the incompleteness of the rock and fossil records. Interest in this topic has led to a divergent suite of independent methods—paleontological estimates based on sampled stratigraphic ranges and phylogenetic estimates based on the observed branching times in a given phylogeny of living species. The fossilized birth–death (FBD) process is a model that explicitly recognizes that the branching events in a phylogenetic tree and sampled fossils were generated by the same underlying diversification process. A crucial advantage of this model is that it incorporates the possibility that some species may never be sampled. Here, we present an FBD model that estimates tree-wide diversification rates from stratigraphic range data when the underlying phylogeny of the fossil taxa may be unknown. The model can be applied when only occurrence data for taxonomically identified fossils are available, but still accounts for the incomplete phylogenetic structure of the data. We tested this new model using simulations and focused on how inferences are impacted by incomplete fossil recovery. We compared our approach with a phylogenetic model that does not incorporate incomplete species sampling and to three fossil-based alternatives for estimating diversification rates, including the widely implemented boundary-crosser and three-timer methods. The results of our simulations demonstrate that estimates under the FBD model are robust and more accurate than the alternative methods, particularly when fossil data are sparse, as the FBD model incorporates incomplete species sampling explicitly. 
    more » « less
  5. Abstract MotivationBranch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. ResultsIn this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. Availability and implementationCASTLES is available at https://github.com/ytabatabaee/CASTLES. 
    more » « less