skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Relative Efficiencies of Simple and Complex Substitution Models in Estimating Divergence Times in Phylogenomics
Abstract The conventional wisdom in molecular evolution is to apply parameter-rich models of nucleotide and amino acid substitutions for estimating divergence times. However, the actual extent of the difference between time estimates produced by highly complex models compared with those from simple models is yet to be quantified for contemporary data sets that frequently contain sequences from many species and genes. In a reanalysis of many large multispecies alignments from diverse groups of taxa, we found that the use of the simplest models can produce divergence time estimates and credibility intervals similar to those obtained from the complex models applied in the original studies. This result is surprising because the use of simple models underestimates sequence divergence for all the data sets analyzed. We found three fundamental reasons for the observed robustness of time estimates to model complexity in many practical data sets. First, the estimates of branch lengths and node-to-tip distances under the simplest model show an approximately linear relationship with those produced by using the most complex models applied on data sets with many sequences. Second, relaxed clock methods automatically adjust rates on branches that experience considerable underestimation of sequence divergences, resulting in time estimates that are similar to those from complex models. And, third, the inclusion of even a few good calibrations in an analysis can reduce the difference in time estimates from simple and complex models. The robustness of time estimates to model complexity in these empirical data analyses is encouraging, because all phylogenomics studies use statistical models that are oversimplified descriptions of actual evolutionary substitution processes.  more » « less
Award ID(s):
1661218
PAR ID:
10174934
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Molecular Biology and Evolution
Volume:
37
Issue:
6
ISSN:
0737-4038
Page Range / eLocation ID:
1819 to 1831
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ho, Simon (Ed.)
    Abstract Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches are most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.] 
    more » « less
  2. Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers ‘‘full-length,’’ represents this ‘‘backbone alignment’’ using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses k > 1 HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments. 
    more » « less
  3. Abstract Simultaneous molecular dating of population and species divergences is essential in many biological investigations, including phylogeography, phylodynamics and species delimitation studies. In these investigations, multiple sequence alignments consist of both intra‐ and interspecies samples (mixed samples). As a result, the phylogenetic trees contain interspecies, interpopulation and within‐population divergences. Bayesian relaxed clock methods are often employed in these analyses, but they assume the same tree prior for both inter‐ and intraspecies branching processes and require specification of a clock model for branch rates (independent vs. autocorrelated rates models). We evaluated the impact of a single tree prior on Bayesian divergence time estimates by analysing computer‐simulated data sets. We also examined the effect of the assumption of independence of evolutionary rate variation among branches when the branch rates are autocorrelated. Bayesian approach with coalescent tree priors generally produced excellent molecular dates and highest posterior densities with high coverage probabilities. We also evaluated the performance of a non‐Bayesian method, RelTime, which does not require the specification of a tree prior or a clock model. RelTime's performance was similar to that of the Bayesian approach, suggesting that it is also suitable to analyse data sets containing both populations and species variation when its computational efficiency is needed. 
    more » « less
  4. Abstract An occupancy model makes use of data that are structured as sets of repeated visits to each of many sites, in order to estimate the actual probability of occupancy (i.e. proportion of occupied sites) after correcting for imperfect detection using the information contained in the sets of repeated observations. We explore the conditions under which preexisting, volunteer-collected data from the citizen science project eBird can be used for fitting occupancy models. Because the majority of eBird’s data are not collected in the form of repeated observations at individual locations, we explore 2 ways in which the single-visit records could be used in occupancy models. First, we assess the potential for space-for-time substitution: aggregating single-visit records from different locations within a region into pseudo-repeat visits. On average, eBird’s observers did not make their observations at locations that were representative of the habitat in the surrounding area, which would lead to biased estimates of occupancy probabilities when using space-for-time substitution. Thus, the use of space-for-time substitution is not always appropriate. Second, we explored the utility of including data from single-visit records to supplement sets of repeated-visit data. In a simulation study we found that inclusion of single-visit records increased the precision of occupancy estimates, but only when detection probabilities are high. When detection probability was low, the addition of single-visit records exacerbated biases in estimates of occupancy probability. We conclude that subsets of data from eBird, and likely from similar projects, can be used for occupancy modeling either using space-for-time substitution or supplementing repeated-visit data with data from single-visit records. The appropriateness of either alternative will depend on the goals of a study and on the probabilities of detection and occupancy of the species of interest. 
    more » « less
  5. Slotte, Tanja (Ed.)
    Abstract Intracellular transfers of mitochondrial DNA continue to shape nuclear genomes. Chromosome 2 of the model plant Arabidopsis thaliana contains one of the largest known nuclear insertions of mitochondrial DNA (numts). Estimated at over 600 kb in size, this numt is larger than the entire Arabidopsis mitochondrial genome. The primary Arabidopsis nuclear reference genome contains less than half of the numt because of its structural complexity and repetitiveness. Recent data sets generated with improved long-read sequencing technologies (PacBio HiFi) provide an opportunity to finally determine the accurate sequence and structure of this numt. We performed a de novo assembly using sequencing data from recent initiatives to span the Arabidopsis centromeres, producing a gap-free sequence of the Chromosome 2 numt, which is 641 kb in length and has 99.933% nucleotide sequence identity with the actual mitochondrial genome. The numt assembly is consistent with the repetitive structure previously predicted from fiber-based fluorescent in situ hybridization. Nanopore sequencing data indicate that the numt has high levels of cytosine methylation, helping to explain its biased spectrum of nucleotide sequence divergence and supporting previous inferences that it is transcriptionally inactive. The original numt insertion appears to have involved multiple mitochondrial DNA copies with alternative structures that subsequently underwent an additional duplication event within the nuclear genome. This work provides insights into numt evolution, addresses one of the last unresolved regions of the Arabidopsis reference genome, and represents a resource for distinguishing between highly similar numt and mitochondrial sequences in studies of transcription, epigenetic modifications, and de novo mutations. 
    more » « less