skip to main content

Title: Phase Resolution of Heterozygous Sites in Diploid Genomes is Important to Phylogenomic Analysis under the Multispecies Coalescent Model
Abstract Genome sequencing projects routinely generate haploid consensus sequences from diploid genomes, which are effectively chimeric sequences with the phase at heterozygous sites resolved at random. The impact of phasing errors on phylogenomic analyses under the multispecies coalescent (MSC) model is largely unknown. Here, we conduct a computer simulation to evaluate the performance of four phase-resolution strategies (the true phase resolution, the diploid analytical integration algorithm which averages over all phase resolutions, computational phase resolution using the program PHASE, and random resolution) on estimation of the species tree and evolutionary parameters in analysis of multilocus genomic data under the MSC model. We found that species tree estimation is robust to phasing errors when species divergences were much older than average coalescent times but may be affected by phasing errors when the species tree is shallow. Estimation of parameters under the MSC model with and without introgression is affected by phasing errors. In particular, random phase resolution causes serious overestimation of population sizes for modern species and biased estimation of cross-species introgression probability. In general, the impact of phasing errors is greater when the mutation rate is higher, the data include more samples per species, and the species tree is shallower more » with recent divergences. Use of phased sequences inferred by the PHASE program produced small biases in parameter estimates. We analyze two real data sets, one of East Asian brown frogs and another of Rocky Mountains chipmunks, to demonstrate that heterozygote phase-resolution strategies have similar impacts on practical data analyses. We suggest that genome sequencing projects should produce unphased diploid genotype sequences if fully phased data are too challenging to generate, and avoid haploid consensus sequences, which have heterozygous sites phased at random. In case the analytical integration algorithm is computationally unfeasible, computational phasing prior to population genomic analyses is an acceptable alternative. [BPP; introgression; multispecies coalescent; phase; species tree.] « less
Authors:
; ; ; ;
Editors:
Thomson, Robert
Award ID(s):
2023723
Publication Date:
NSF-PAR ID:
10334630
Journal Name:
Systematic Biology
Volume:
71
Issue:
2
Page Range or eLocation-ID:
334 to 352
ISSN:
1063-5157
Sponsoring Org:
National Science Foundation
More Like this
  1. Baldauf, Sandra (Ed.)
    Abstract The southwestern and central United States serve as an ideal region to test alternative hypotheses regarding biotic diversification. Genomic data can now be combined with sophisticated computational models to quantify the impacts of paleoclimate change, geographic features, and habitat heterogeneity on spatial patterns of genetic diversity. In this study, we combine thousands of genotyping-by-sequencing (GBS) loci with mtDNA sequences (ND1) from the Texas horned lizard (Phrynosoma cornutum) to quantify relative support for different catalysts of diversification. Phylogenetic and clustering analyses of the GBS data indicate support for at least three primary populations. The spatial distribution of populations appears concordant with habitat type, with desert populations in AZ and NM showing the largest genetic divergence from the remaining populations. The mtDNA data also support a divergent desert population, but other relationships differ and suggest mtDNA introgression. Genotype–environment association with bioclimatic variables supports divergence along precipitation gradients more than along temperature gradients. Demographic analyses support a complex history, with introgression and gene flow playing an important role during diversification. Bayesian multispecies coalescent analyses with introgression (MSci) analyses also suggest that gene flow occurred between populations. Paleo-species distribution models support two southern refugia that geographically correspond to contemporary lineages. We find thatmore »divergence times are underestimated and population sizes are overestimated when introgression occurred and is ignored in coalescent analyses, and furthermore, inference of ancient introgression events and demographic history is sensitive to inclusion of a single recently admixed sample. Our analyses cannot refute the riverine barrier or glacial refugia hypotheses. Results also suggest that populations are continuing to diverge along habitat gradients. Finally, the strong evidence of admixture, gene flow, and mtDNA introgression among populations suggests that P. cornutum should be considered a single widespread species under the General Lineage Species Concept.« less
  2. de los Campos, G (Ed.)
    Abstract De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync’s correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions.
  3. Comprising 501 genera and around 14,000 species, Papilionoideae is not only the largest subfamily of Fabaceae (Leguminosae; legumes), but also one of the most extraordinarily diverse clades among angiosperms. Papilionoids are a major source of food and forage, are ecologically successful in all major biomes, and display dramatic variation in both floral architecture and plastid genome (plastome) structure. Plastid DNA-based phylogenetic analyses have greatly improved our understanding of relationships among the major groups of Papilionoideae, yet the backbone of the subfamily phylogeny remains unresolved. In this study, we sequenced and assembled 39 new plastomes that are covering key genera representing the morphological diversity in the subfamily. From 244 total taxa, we produced eight datasets for maximum likelihood (ML) analyses based on entire plastomes and/or concatenated sequences of 77 protein-coding sequences (CDS) and two datasets for multispecies coalescent (MSC) analyses based on individual gene trees. We additionally produced a combined nucleotide dataset comprising CDS plus matK gene sequences only, in which most papilionoid genera were sampled. A ML tree based on the entire plastome maximally supported all of the deep and most recent divergences of papilionoids (223 out of 236 nodes). The Swartzieae, ADA (Angylocalyceae, Dipterygeae, and Amburaneae), Cladrastis, Andira, andmore »Exostyleae clades formed a grade to the remainder of the Papilionoideae, concordant with nine ML and two MSC trees. Phylogenetic relationships among the remaining five papilionoid lineages (Vataireoid, Dermatophyllum , Genistoid s.l., Dalbergioid s.l., and Baphieae + Non-Protein Amino Acid Accumulating or NPAAA clade) remained uncertain, because of insufficient support and/or conflicting relationships among trees. Our study fully resolved most of the deep nodes of Papilionoideae, however, some relationships require further exploration. More genome-scale data and rigorous analyses are needed to disentangle phylogenetic relationships among the five remaining lineages.« less
  4. Building the Tree of Life (ToL) is a major challenge of modern biology, requiring advances in cyberinfrastructure, data collection, theory, and more. Here, we argue that phylogenomics stands to benefit by embracing the many heterogeneous genomic signals emerging from the first decade of large-scale phylogenetic analysis spawned by high-throughput sequencing (HTS). Such signals include those most commonly encountered in phylogenomic datasets, such as incomplete lineage sorting, but also those reticulate processes emerging with greater frequency, such as recombination and introgression. Here we focus specifically on how phylogenetic methods can accommodate the heterogeneity incurred by such population genetic processes; we do not discuss phylogenetic methods that ignore such processes, such as concatenation or supermatrix approaches or supertrees. We suggest that methods of data acquisition and the types of markers used in phylogenomics will remain restricted until a posteriori methods of marker choice are made possible with routine whole-genome sequencing of taxa of interest. We discuss limitations and potential extensions of a model supporting innovation in phylogenomics today, the multispecies coalescent model (MSC). Macroevolutionary models that use phylogenies, such as character mapping, often ignore the heterogeneity on which building phylogenies increasingly rely and suggest that assimilating such heterogeneity is an important goalmore »moving forward. Finally, we argue that an integrative cyberinfrastructure linking all steps of the process of building the ToL, from specimen acquisition in the field to publication and tracking of phylogenomic data, as well as a culture that values contributors at each step, are essential for progress.

    « less
  5. Abstract

    In the past two decades, genomic data have been widely used to detect historical gene flow between species in a variety of plants and animals. The Tamias quadrivittatus group of North America chipmunks, which originated through a series of rapid speciation events, are known to undergo massive amounts of mitochondrial introgression. Yet in a recent analysis of targeted nuclear loci from the group, no evidence for cross-species introgression was detected, indicating widespread cytonuclear discordance. The study used the heuristic method HYDE to detect gene flow, which may suffer from low power. Here we use the Bayesian method implemented in the program BPP to re-analyze these data. We develop a Bayesian test of introgression, calculating the Bayes factor via the Savage-Dickey density ratio using the Markov chain Monte Carlo (MCMC) sample under the model of introgression. We take a stepwise approach to constructing an introgression model by adding introgression events onto a well-supported binary species tree. The analysis detected robust evidence for multiple ancient introgression events affecting the nuclear genome, with introgression probabilities reaching 63%. We estimate population parameters and highlight the fact that species divergence times may be seriously underestimated if ancient cross-species gene flow is ignored in themore »analysis. We examine the assumptions and performance of HYDE and demonstrate that it lacks power if gene flow occurs between sister lineages or if the mode of gene flow does not match the assumed hybrid-speciation model with symmetrical population sizes. Our analyses highlight the power of likelihood-based inference of cross-species gene flow using genomic sequence data. [Bayesian test; BPP; chipmunks; introgression; MSci; multispecies coalescent; Savage-Dickey density ratio.]

    « less