skip to main content


Title: Comparing Likelihood Ratios to Understand Genome-Wide Variation in Phylogenetic Support
Abstract

Genomic data have only sometimes brought resolution to the tree of life. Large phylogenomic studies can reach conflicting conclusions about important relationships, with mutually exclusive hypotheses receiving strong support. Reconciling such differences requires a detailed understanding of how phylogenetic signal varies among data sets. Two complementary strategies for better understanding phylogenomic conflicts are to examine support on a locus-by-locus basis and use support values that capture a larger range of variation in phylogenetic information, such as likelihood ratios. Likelihood ratios can be calculated using either maximum or marginal likelihoods. Despite being conceptually similar, differences in how these ratios are calculated and interpreted have not been closely examined in phylogenomics. Here, we compare the behavior of maximum and marginal likelihood ratios when evaluating alternate resolutions of recalcitrant relationships among major squamate lineages. We find that these ratios are broadly correlated between loci, but the correlation is driven by extreme values. As a consequence, the proportion of loci that support a hypothesis can change depending on which ratio is used and whether smaller values are discarded. In addition, maximum likelihood ratios frequently exhibit identical support for alternate hypotheses, making conflict resolution a challenge. We find surprising support for a sister relationship between snakes and iguanians across four different phylogenomic data sets in contrast to previous empirical studies. [Bayes factors; likelihood ratios; marginal likelihood; maximum likelihood; phylogenomics; squamates.]

 
more » « less
Award ID(s):
1950759 1950954
NSF-PAR ID:
10402540
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
Volume:
71
Issue:
4
ISSN:
1063-5157
Format(s):
Medium: X Size: p. 973-985
Size(s):
["p. 973-985"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Premise

    Cornales is an order of flowering plants containing ecologically and horticulturally important families, including Cornaceae (dogwoods) and Hydrangeaceae (hydrangeas), among others. While many relationships in Cornales are strongly supported by previous studies, some uncertainty remains with regards to the placement of Hydrostachyaceae and to relationships among families in Cornales and within Cornaceae. Here we analyzed hundreds of nuclear loci to test published phylogenetic hypotheses and estimated a robust species tree for Cornales.

    Methods

    Using the Angiosperms353 probe set and existing data sets, we generated phylogenomic data for 158 samples, representing all families in the Cornales, with intensive sampling in the Cornaceae.

    Results

    We curated an average of 312 genes per sample, constructed maximum likelihood gene trees, and inferred a species tree using the summary approach implemented in ASTRAL‐III, a method statistically consistent with the multispecies coalescent model.

    Conclusions

    The species tree we constructed generally shows high support values and a high degree of concordance among individual nuclear gene trees. Relationships among families are largely congruent with previous molecular studies, except for the placement of the nyssoids and the Grubbiaceae‐Curtisiaceae clades. Furthermore, we were able to place Hydrostachyaceae within Cornales, and within Cornaceae, the monophyly of known morphogroups was well supported. However, patterns of gene tree discordance suggest potential ancient reticulation, gene flow, and/or ILS in the Hydrostachyaceae lineage and the early diversification ofCornus. Our findings reveal new insights into the diversification process across Cornales and demonstrate the utility of the Angiosperms353 probe set.

     
    more » « less
  2. Abstract

    Next‐generation sequencing technologies (NGS) allow systematists to amass a wealth of genomic data from non‐model species for phylogenetic resolution at various temporal scales. However, phylogenetic inference for many lineages dominated by non‐model species has not yet benefited from NGS, which can complement Sanger sequencing studies. One such lineage, whose phylogenetic relationships remain uncertain, is the diverse, agriculturally important and charismatic Coreoidea (Hemiptera: Heteroptera). Given the lack of consensus on higher‐level relationships and the importance of a robust phylogeny for evolutionary hypothesis testing, we use a large data set comprised of hundreds of ultraconserved element (UCE) loci to infer the phylogeny of Coreoidea (excluding Stenocephalidae and Hyocephalidae), with emphasis on the families Coreidae and Alydidae. We generated three data sets by including alignments that contained loci sampled for at least 50%, 60%, or 70% of the total taxa, and inferred phylogeny using maximum likelihood and summary coalescent methods. Twenty‐six external morphological features used in relatively comprehensive phylogenetic analyses of coreoids were also re‐evaluated within our molecular phylogenetic framework. We recovered 439–970 loci per species (16%–36% of loci targeted) and combined this with previously generated UCE data for 12 taxa. All data sets, regardless of analytical approach, yielded topologically similar and strongly supported trees, with the exception of outgroup relationships and the position of Hydarinae. We recovered a monophyletic Coreoidea, with Rhopalidae highly supported as the sister group to Alydidae + Coreidae. Neither Alydidae nor Coreidae were monophyletic; the coreid subfamilies Hydarinae and Pseudophloeinae were recovered as more closely related to Alydidae than to other coreid subfamilies. Coreinae were paraphyletic with respect to Meropachyinae. Most morphological traits were homoplastic with several clades defined by few, if any, synapomorphies. Our results demonstrate the utility of phylogenomic approaches in generating robust hypotheses for taxa with long‐standing phylogenetic problems and highlight that novel insights may come from such approaches.

     
    more » « less
  3. Abstract

    Reconstructing accurate historical relationships within a species poses numerous challenges, not least in many plant groups in which gene flow is high enough to extend well beyond species boundaries. Nonetheless, the extent of tree-like history within a species is an empirical question on which it is now possible to bring large amounts of genome sequence to bear. We assess phylogenetic structure across the geographic range of the saguaro cactus, an emblematic member of Cactaceae, a clade known for extensive hybridization and porous species boundaries. Using 200 Gb of whole genome resequencing data from 20 individuals sampled from 10 localities, we assembled two data sets comprising 150,000 biallelic single nucleotide polymorphisms (SNPs) from protein coding sequences. From these, we inferred within-species trees and evaluated their significance and robustness using five qualitatively different inference methods. Despite the low sequence diversity, large census population sizes, and presence of wide-ranging pollen and seed dispersal agents, phylogenetic trees were well resolved and highly consistent across both data sets and all methods. We inferred that the most likely root, based on marginal likelihood comparisons, is to the east and south of the region of highest genetic diversity, which lies along the coast of the Gulf of California in Sonora, Mexico. Together with striking decreases in marginal likelihood found to the north, this supports hypotheses that saguaro’s current range reflects postglacial expansion from the refugia in the south of its range. We conclude with observations about practical and theoretical issues raised by phylogenomic data sets within species, in which SNP-based methods must be used rather than gene tree methods that are widely used when sequence divergence is higher. These include computational scalability, inference of gene flow, and proper assessment of statistical support in the presence of linkage effects. [Phylogenomics; phylogeography; rooting; Sonoran Desert.]

     
    more » « less
  4. Abstract

    Phylogenomic analysis of large genome-wide sequence data sets can resolve phylogenetic tree topologies for large species groups, help test the accuracy of and improve resolution for earlier multi-locus studies and reveal the level of agreement or concordance within partitions of the genome for various tree topologies. Here we used a target-capture approach to sequence 1088 single-copy exons for more than 200 labrid fishes together with more than 100 outgroup taxa to generate a new data-rich phylogeny for the family Labridae. Our time-calibrated phylogenetic analysis of exon-capture data pushes the root node age of the family Labridae back into the Cretaceous to about 79 Ma years ago. The monotypic Centrogenys vaigiensis, and the order Uranoscopiformes (stargazers) are identified as the sister lineages of Labridae. The phylogenetic relationships among major labrid subfamilies and within these clades were largely congruent with prior analyses of select mitochondrial and nuclear datasets. However, the position of the tribe Cirrhilabrini (fairy and flame wrasses) showed discordance, resolving either as the sister to a crown julidine clade or alternatively sister to a group formed by the labrines, cheilines and scarines. Exploration of this pattern using multiple approaches leads to slightly higher support for this latter hypothesis, highlighting the importance of genome-level data sets for resolving short internodes at key phylogenetic positions in a large, economically important groups of coral reef fishes. More broadly, we demonstrate how accounting for sources of biological variability from incomplete lineage sorting and exploring systematic error at conflicting nodes can aid in evaluating alternative phylogenetic hypotheses. [coral reefs; divergence time estimation; exon-capture; fossil calibration; incomplete lineage sorting.]

     
    more » « less
  5. Abstract

    Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

     
    more » « less