skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Assessing the Adequacy of Morphological Models Using Posterior Predictive Simulations
Abstract Reconstructing the evolutionary history of different groups of organisms provides insight into how life originated and diversified on Earth. Phylogenetic trees are commonly used to estimate this evolutionary history. Within Bayesian phylogenetics a major step in estimating a tree is in choosing an appropriate model of character evolution. While the most common character data used is molecular sequence data, morphological data remains a vital source of information. The use of morphological characters allows for the incorporation fossil taxa, and despite advances in molecular sequencing, continues to play a significant role in neontology. Moreover, it is the main data source that allows us to unite extinct and extant taxa directly under the same generating process. We therefore require suitable models of morphological character evolution, the most common being the Mk Lewis model. While it is frequently used in both palaeobiology and neontology, it is not known whether the simple Mk substitution model, or any extensions to it, provide a sufficiently good description of the process of morphological evolution. In this study we investigate the impact of different morphological models on empirical tetrapod datasets. Specifically, we compare unpartitioned Mk models with those where characters are partitioned by the number of observed states, both with and without allowing for rate variation across sites and accounting for ascertainment bias. We show that the choice of substitution model has an impact on both topology and branch lengths, highlighting the importance of model choice. Through simulations, we validate the use of the model adequacy approach, posterior predictive simulations, for choosing an appropriate model. Additionally, we compare the performance of model adequacy with Bayesian model selection. We demonstrate how model selection approaches based on marginal likelihoods are not appropriate for choosing between models with partition schemes that vary in character state space (i.e., that vary in Q-matrix state size). Using posterior predictive simulations, we found that current variations of the Mk model are often performing adequately in capturing the evolutionary dynamics that generated our data. We do not find any preference for a particular model extension across multiple datasets, indicating that there is no “one size fits all” when it comes to morphological data and that careful consideration should be given to choosing models of discrete character evolution. By using suitable models of character evolution, we can increase our confidence in our phylogenetic estimates, which should in turn allow us to gain more accurate insights into the evolutionary history of both extinct and extant taxa.  more » « less
Award ID(s):
1754705 2045842
PAR ID:
10570984
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Klopfstein, Seraina
Publisher / Repository:
Society of Systematic Biologists
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Subject(s) / Keyword(s):
Bayesian phylogenetic analysis model adequacy model selection morphological data morphological models palaeobiology
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sanmartin, I (Ed.)
    Phylogenetic trees establish a historical context for the study of organismal form and function. Most phylogenetic trees are estimated using a model of evolution. For molecular data, modeling evolution is often based on biochemical observations about changes between character states. For example, there are four nucleotides, and we can make assumptions about the probability of transitions between them. By contrast, for morphological characters, we may not know \textit{a priori} how many characters states there are per character, as both extant sampling and the fossil record may be highly incomplete, which leads to an observer bias. For a given character, the state space may be larger than what has been observed in the sample of taxa collected by the researcher. In this case, how many evolutionary rates are needed to even describe transitions between morphological character states may not be clear, potentially leading to model misspecification. To explore the impact of this model misspecification, we simulated character data with varying numbers of character states per character. We then used the data to estimate phylogenetic trees using models of evolution with the correct number of character states and an incorrect number of character states. The results of this study indicate that this observer bias may lead to phylogenetic error, particularly in the branch lengths of trees. If the state space is wrongly assumed to be too large, then we underestimate the branch lengths, and the opposite occurs when the state space is wrongly assumed to be too small. 
    more » « less
  2. Friedmann, M (Ed.)
    Abstract Phylogenetic trees establish a historical context for the study of organismal form and function. Most phylogenetic trees are estimated using a model of evolution. For molecular data, modeling evolution is often based on biochemical observations about changes between character states. For example, there are four nucleotides, and we can make assumptions about the probability of transitions between them. By contrast, for morphological characters, we may not know a priori how many characters states there are per character, as both extant sampling and the fossil record may be highly incomplete, which leads to an observer bias. For a given character, the state space may be larger than what has been observed in the sample of taxa collected by the researcher. In this case, how many evolutionary rates are needed to even describe transitions between morphological character states may not be clear, potentially leading to model misspecification. To explore the impact of this model misspecification, we simulated character data with varying numbers of character states per character. We then used the data to estimate phylogenetic trees using models of evolution with the correct number of character states and an incorrect number of character states. The results of this study indicate that this observer bias may lead to phylogenetic error, particularly in the branch lengths of trees. If the state space is wrongly assumed to be too large, then we underestimate the branch lengths, and the opposite occurs when the state space is wrongly assumed to be too small. 
    more » « less
  3. Paleontological and neontological systematics seek to answer evolutionary questions with different datasets. Phylogenies inferred for combined extant and extinct taxa provide novel insights into the evolutionary history of life. Primates have an extensive, diverse fossil record and molecular data for living and extinct taxa are rapidly becoming available. We used two models to infer the phylogeny and divergence times for living and fossil primates, the tip-dating (TD) and fossilized birth-death process (FBD). We collected new morphological data, especially on the living and extinct endemic lemurs of Madagascar. We combined the morphological data with published DNA sequences to infer near-complete (88% of lemurs) time-calibrated phylogenies. The results suggest that primates originated around the Cretaceous-Tertiary boundary, slightly earlier than indicated by the fossil record and later than previously inferred from molecular data alone. We infer novel relationships among extinct lemurs, and strong support for relationships that were previously unresolved. Dates inferred with TD were significantly older than those inferred with FBD, most likely related to an assumption of a uniform branching process in the TD compared to a birth-death process assumed in the FBD. This is the first study to combine morphological and DNA sequence data from extinct and extant primates to infer evolutionary relationships and divergence times, and our results shed new light on the tempo of lemur evolution and the efficacy of combined phylogenetic analyses. 
    more » « less
  4. null (Ed.)
    Abstract Background Matrices of morphological characters are frequently used for dating species divergence times in systematics. In some studies, morphological and molecular character data from living taxa are combined, whereas others use morphological characters from extinct taxa as well. We investigated whether morphological data produce time estimates that are concordant with molecular data. If true, it will justify the use of morphological characters alongside molecular data in divergence time inference. Results We systematically analyzed three empirical datasets from different species groups to test the concordance of species divergence dates inferred using molecular and discrete morphological data from extant taxa as test cases. We found a high correlation between their divergence time estimates, despite a poor linear relationship between branch lengths for morphological and molecular data mapped onto the same phylogeny. This was because node-to-tip distances showed a much higher correlation than branch lengths due to an averaging effect over multiple branches. We found that nodes with a large number of taxa often benefit from such averaging. However, considerable discordance between time estimates from molecules and morphology may still occur as  some intermediate nodes may show large time differences between these two types of data. Conclusions Our findings suggest that node- and tip-calibration approaches may be better suited for nodes with many taxa. Nevertheless, we highlight the importance of evaluating the concordance of intrinsic time structure in morphological and molecular data before any dating analysis using combined datasets. 
    more » « less
  5. Liliana Davalos (Ed.)
    Logical character dependency is a major conceptual and methodological problem in phylogenetic inference of morphological data sets, as it violates the assumption of character independence that is common to all phylogenetic methods. It is more frequently observed in higher-level phylogenies or in data sets characterizing major evolutionary transitions, as these represent parts of the tree of life where (primary) anatomical characters either originate or disappear entirely. As a result, secondary traits related to these primary characters become “inapplicable” across all sampled taxa in which that character is absent. Various solutions have been explored over the last three decades to handle character dependency, such as alternative character coding schemes and, more recently, new algorithmic implementations. However, the accuracy of the proposed solutions, or the impact of character dependency across distinct optimality criteria, has never been directly tested using standard performance measures. Here, we utilize simple and complex simulated morphological data sets analyzed under different maximum parsimony optimization procedures and Bayesian inference to test the accuracy of various coding and algorithmic solutions to character dependency. This is complemented by empirical analyses using a recoded data set on palaeognathid birds. We find that in small, simulated data sets, absent coding performs better than other popular coding strategies available (contingent and multistate), whereas in more complex simulations (larger data sets controlled for different tree structure and character distribution models) contingent coding is favored more frequently. Under contingent coding, a recently proposed weighting algorithm produces the most accurate results for maximum parsimony. However, Bayesian inference outperforms all parsimony-based solutions to handle character dependency due to fundamental differences in their optimization procedures—a simple alternative that has been long overlooked. Yet, we show that the more primary characters bearing secondary (dependent) traits there are in a data set, the harder it is to estimate the true phylogenetic tree, regardless of the optimality criterion, owing to a considerable expansion of the tree parameter space. 
    more » « less