skip to main content


Title: Detecting and Removing Sample Contamination in Phylogenomic Data: An Example and its Implications for Cicadidae Phylogeny (Insecta: Hemiptera)
Abstract

Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

 
more » « less
Award ID(s):
1655891
NSF-PAR ID:
10368842
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The infraorder Mygalomorphae is one of the three main lineages of spiders comprising over 3000 nominal species. This ancient group has a worldwide distribution that includes among its ranks large and charismatic taxa such as tarantulas, trapdoor spiders, and highly venomous funnel-web spiders. Based on past molecular studies using Sanger-sequencing approaches, numerous mygalomorph families (e.g., Hexathelidae, Ctenizidae, Cyrtaucheniidae, Dipluridae, and Nemesiidae) have been identified as non-monophyletic. However, these data were unable to sufficiently resolve the higher-level (intra- and interfamilial) relationships such that the necessary changes in classification could be made with confidence. Here, we present a comprehensive phylogenomic treatment of the spider infraorder Mygalomorphae. We employ 472 loci obtained through anchored hybrid enrichment to reconstruct relationships among all the mygalomorph spider families and estimate the timeframe of their diversification. We sampled nearly all currently recognized families, which has allowed us to assess their status, and as a result, propose a new classification scheme. Our generic-level sampling has also provided an evolutionary framework for revisiting questions regarding silk use in mygalomorph spiders. The first such analysis for the group within a strict phylogenetic framework shows that a sheet web is likely the plesiomorphic condition for mygalomorphs, as well as providing insights to the ancestral foraging behavior for all spiders. Our divergence time estimates, concomitant with detailed biogeographic analysis, suggest that both ancient continental-level vicariance and more recent dispersal events have played an important role in shaping modern day distributional patterns. Based on our results, we relimit the generic composition of the Ctenizidae, Cyrtaucheniidae, Dipluridae, and Nemesiidae. We also elevate five subfamilies to family rank: Anamidae (NEW RANK), Euagridae (NEW RANK), Ischnothelidae (NEW RANK), Pycnothelidae (NEW RANK), and Bemmeridae (NEW RANK). Three families Entypesidae (NEW FAMILY), Microhexuridae (NEW FAMILY), and Stasimopidae (NEW FAMILY), and one subfamily Australothelinae (NEW SUBFAMILY) are newly proposed. Such a major rearrangement in classification, recognizing nine newly established family-level rank taxa, is the largest the group has seen in over three decades. [Biogeography; molecular clocks; phylogenomics; spider web foraging; taxonomy.]

     
    more » « less
  2. Wiegmann, Brian (Ed.)
    Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.] 
    more » « less
  3. Abstract

    Next‐generation sequencing technologies (NGS) allow systematists to amass a wealth of genomic data from non‐model species for phylogenetic resolution at various temporal scales. However, phylogenetic inference for many lineages dominated by non‐model species has not yet benefited from NGS, which can complement Sanger sequencing studies. One such lineage, whose phylogenetic relationships remain uncertain, is the diverse, agriculturally important and charismatic Coreoidea (Hemiptera: Heteroptera). Given the lack of consensus on higher‐level relationships and the importance of a robust phylogeny for evolutionary hypothesis testing, we use a large data set comprised of hundreds of ultraconserved element (UCE) loci to infer the phylogeny of Coreoidea (excluding Stenocephalidae and Hyocephalidae), with emphasis on the families Coreidae and Alydidae. We generated three data sets by including alignments that contained loci sampled for at least 50%, 60%, or 70% of the total taxa, and inferred phylogeny using maximum likelihood and summary coalescent methods. Twenty‐six external morphological features used in relatively comprehensive phylogenetic analyses of coreoids were also re‐evaluated within our molecular phylogenetic framework. We recovered 439–970 loci per species (16%–36% of loci targeted) and combined this with previously generated UCE data for 12 taxa. All data sets, regardless of analytical approach, yielded topologically similar and strongly supported trees, with the exception of outgroup relationships and the position of Hydarinae. We recovered a monophyletic Coreoidea, with Rhopalidae highly supported as the sister group to Alydidae + Coreidae. Neither Alydidae nor Coreidae were monophyletic; the coreid subfamilies Hydarinae and Pseudophloeinae were recovered as more closely related to Alydidae than to other coreid subfamilies. Coreinae were paraphyletic with respect to Meropachyinae. Most morphological traits were homoplastic with several clades defined by few, if any, synapomorphies. Our results demonstrate the utility of phylogenomic approaches in generating robust hypotheses for taxa with long‐standing phylogenetic problems and highlight that novel insights may come from such approaches.

     
    more » « less
  4. Abstract

    The family Mutillidae (Hymenoptera) is a species‐rich group of aculeate wasps that occur worldwide. The higher‐level classification of the family has historically been controversial due, in part, to the extreme sexual dimorphism exhibited by these insects and their morphological similarity to other wasp taxa that also have apterous females. Modern hypotheses on the internal higher classification of Mutillidae have been exclusively based on morphology and, further, they include Myrmosinae as a mutillid subfamily. In contrast, several molecular‐based family‐level studies of Aculeata recovered Myrmosinae as a nonmutillid taxon. To test the validity of these morphology‐based classifications and the phylogenetic placement of the controversial taxon Myrmosinae, a phylogenomic study of Mutillidae was conducted using ultraconserved elements (UCEs). All currently recognized subfamilies and tribes of Mutillidae were represented in this study using 140 ingroup taxa. The maximum likelihood criterion (ML) and the maximum parsimony criterion (MP) were used to infer the phylogenetic relationships within the family and related taxa using an aligned data set of 238,764 characters; the topologies of these respective analyses were largely congruent. The modern higher classification of Mutillidae, based on morphology, is largely congruent with the phylogenomic results of this study at the subfamily level, whereas the tribal classification is poorly supported. The subfamily Myrmosinae was recovered as sister to Sapygidae in the ML analysis and sister to Sapygidae + Pompilidae in the MP analysis; it is consequently raised to the family level, Myrmosidae,stat.nov.The two constituent tribes of Myrmosidae are raised to the subfamily level, Kudakrumiinae,stat.nov., and Myrmosinae,stat.nov.All four recognized tribes of Mutillinae were found to be non‐monophyletic; three additional mutilline clades were recovered in addition to Ctenotillini, Mutillini, Smicromyrmini, and Trogaspidiini sensu stricto. Three new tribes are erected for members of these clades: Pristomutillini Waldren,trib.nov., Psammothermini Waldren,trib.nov., and Zeugomutillini Waldren,trib.nov.All three recognized tribes of Sphaeropthalminae were found to be non‐monophyletic; six additional sphaeropthalmine clades were recovered in addition to Dasymutillini, Pseudomethocini, and Sphaeropthalmini sensu stricto. The subtribe Ephutina of Mutillinae: Mutillini was found to be polyphyletic, with theEphutagenus‐group recovered within Sphaeropthalminae and theOdontomutillagenus‐group recovered as sister to Myrmillinae + Mutillinae. Consequently, the subtribe Ephutina is transferred from Mutillinae: Mutillini and is raised to a tribe within Sphaeropthalminae, Ephutini,stat.nov.Further, the taxon Odontomutillinae,stat.nov., is raised from a synonym of Ephutina to the subfamily level. The sphaeropthalmine tribe Pseudomethocini was found to be polyphyletic, with the subtribe Euspinoliina recovered as a separate clade in Sphaeropthalminae; consequently, Euspinoliina is raised to a tribe, Euspinoliini,stat.nov., in Sphaeropthalminae. The dasylabrine tribe Apteromutillini was recovered within Dasylabrini and is proposed as a new synonym of Dasylabrinae. Finally, dating analyses were conducted to infer the ages of the Pompiloidea families (Mutillidae, Myrmosidae, Pompilidae, and Sapygidae) and the ages of the Mutillidae subfamilies and tribes.

     
    more » « less
  5. Abstract

    Modern genomic techniques have enabled the generation of phylogenetic datasets of unprecedented scale. However, there are also troves of molecular data accumulated from past studies using Sanger sequencing, often at fine taxonomic scales. Combining both sources of data is an obviously appealing possibility, but it can also lead to inconsistency due to high levels of missing data, disparities in the scale of Sanger versus genomic datasets, and little overlap in sequences across terminals. To provide an empirical investigation of the potential of such ‘hybrid’ datasets, we combined data from ultraconserved elements (UCEs) for 183 species of Cryptini (Ichneumonidae, Hymenoptera) with a previously existing dataset of 7 loci and morphological data including 308 species plus outgroup taxa. Bioinformatics pipelines allowed recovery of ‘legacy’ markers from the bycatch of UCE sequencing, reducing the problem of limited character overlap. The resulting tree combining Sanger and UCE data is highly supported and includes dense taxon sampling of the group, allowing for a better understanding of the global radiation of Cryptini. The Neotropical region had the highest phylogenetic diversity but the lowest level of phylogenetic dispersion when corrected for standardized effect size, while the Oriental fauna showed the highest level of phylogenetic dispersion. Our results highlight the potential of hybrid datasets to produce a more complete picture of the Tree of Life combining affordability, robust support and deep taxonomic sampling.

     
    more » « less