skip to main content

Title: Detecting and Removing Sample Contamination in Phylogenomic Data: An Example and its Implications for Cicadidae Phylogeny (Insecta: Hemiptera)
Abstract

Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in more » part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

« less
Authors:
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Award ID(s):
1655891
Publication Date:
NSF-PAR ID:
10373416
Journal Name:
Systematic Biology
Volume:
71
Issue:
6
Page Range or eLocation-ID:
p. 1504-1523
ISSN:
1063-5157
Publisher:
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The infraorder Mygalomorphae is one of the three main lineages of spiders comprising over 3000 nominal species. This ancient group has a worldwide distribution that includes among its ranks large and charismatic taxa such as tarantulas, trapdoor spiders, and highly venomous funnel-web spiders. Based on past molecular studies using Sanger-sequencing approaches, numerous mygalomorph families (e.g., Hexathelidae, Ctenizidae, Cyrtaucheniidae, Dipluridae, and Nemesiidae) have been identified as non-monophyletic. However, these data were unable to sufficiently resolve the higher-level (intra- and interfamilial) relationships such that the necessary changes in classification could be made with confidence. Here, we present a comprehensive phylogenomic treatment of the spider infraorder Mygalomorphae. We employ 472 loci obtained through anchored hybrid enrichment to reconstruct relationships among all the mygalomorph spider families and estimate the timeframe of their diversification. We sampled nearly all currently recognized families, which has allowed us to assess their status, and as a result, propose a new classification scheme. Our generic-level sampling has also provided an evolutionary framework for revisiting questions regarding silk use in mygalomorph spiders. The first such analysis for the group within a strict phylogenetic framework shows that a sheet web is likely the plesiomorphic condition for mygalomorphs, as well as providingmore »insights to the ancestral foraging behavior for all spiders. Our divergence time estimates, concomitant with detailed biogeographic analysis, suggest that both ancient continental-level vicariance and more recent dispersal events have played an important role in shaping modern day distributional patterns. Based on our results, we relimit the generic composition of the Ctenizidae, Cyrtaucheniidae, Dipluridae, and Nemesiidae. We also elevate five subfamilies to family rank: Anamidae (NEW RANK), Euagridae (NEW RANK), Ischnothelidae (NEW RANK), Pycnothelidae (NEW RANK), and Bemmeridae (NEW RANK). Three families Entypesidae (NEW FAMILY), Microhexuridae (NEW FAMILY), and Stasimopidae (NEW FAMILY), and one subfamily Australothelinae (NEW SUBFAMILY) are newly proposed. Such a major rearrangement in classification, recognizing nine newly established family-level rank taxa, is the largest the group has seen in over three decades. [Biogeography; molecular clocks; phylogenomics; spider web foraging; taxonomy.]

    « less
  2. Wiegmann, Brian (Ed.)
    Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations andmore »found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]« less
  3. Abstract Estimating multiple sequence alignments (MSAs) and inferring phylogenies are essential for many aspects of comparative biology. Yet, many bioinformatics tools for such analyses have focused on specific clades, with greatest attention paid to plants, animals, and fungi. The rapid increase in high-throughput sequencing (HTS) data from diverse lineages now provides opportunities to estimate evolutionary relationships and gene family evolution across the eukaryotic tree of life. At the same time, these types of data are known to be error-prone (e.g., substitutions, contamination). To address these opportunities and challenges, we have refined a phylogenomic pipeline, now named PhyloToL, to allow easy incorporation of data from HTS studies, to automate production of both MSAs and gene trees, and to identify and remove contaminants. PhyloToL is designed for phylogenomic analyses of diverse lineages across the tree of life (i.e., at scales of >100 My). We demonstrate the power of PhyloToL by assessing stop codon usage in Ciliophora, identifying contamination in a taxon- and gene-rich database and exploring the evolutionary history of chromosomes in the kinetoplastid parasite Trypanosoma brucei, the causative agent of African sleeping sickness. Benchmarking PhyloToL’s homology assessment against that of OrthoMCL and a published paper on superfamilies of bacterial and eukaryoticmore »organellar outer membrane pore-forming proteins demonstrates the power of our approach for determining gene family membership and inferring gene trees. PhyloToL is highly flexible and allows users to easily explore HTS data, test hypotheses about phylogeny and gene family evolution and combine outputs with third-party tools (e.g., PhyloChromoMap, iGTP).« less
  4. A molecular phylogeny and a review of family-group classification are presented for 137 species (ca. 125 genera) of the insect family Cicadidae, the true cicadas, plus two species of hairy cicadas (Tettigarctidae) and two outgroup species from Cercopidae. Five genes, two of them mitochondrial, comprise the 4992 base-pair molecular dataset. Maximum-likelihood and Bayesian phylogenetic results are shown, including analyses to address potential base composition bias. Tettigarcta is confirmed as the sister-clade of the Cicadidae and support is found for three subfamilies identified in an earlier morphological cladistic analysis. A set of paraphyletic deep-level clades formed by African genera are together named as Tettigomyiinae n. stat. Taxonomic reassignments of genera and tribes are made where morphological examination confirms incorrect placements suggested by the molecular tree, and 11 new tribes are defined (Arenopsaltriini n. tribe, Durangonini n. tribe, Katoini n. tribe, Lacetasini n. tribe, Macrotristriini n. tribe, Malagasiini n. tribe, Nelcyndanini n. tribe, Pagiphorini n. tribe, Pictilini n. tribe, Psaltodini n. tribe, and Selymbriini n. tribe). Tribe Tacuini n. syn. is synonymized with Cryptotympanini, and Tryellina n. syn. is synonymized with an expanded Tribe Lamotialnini. Tribe Hyantiini n. syn. is synonymized with Fidicinini. Tribe Sinosenini is transferred to Cicadinae from Cicadettinae, Cicadatrinimore »is moved to Cicadettinae from Cicadinae, and Ydiellini and Tettigomyiini are transferred to Tettigomyiinae n. stat from Cicadettinae. While the subfamily Cicadinae, historically defined by the presence of timbal covers, is weakly supported in the molecular tree, high taxonomic rank is not supported for several earlier clades based on unique morphology associated with sound production.« less
  5. Abstract Marker selection has emerged as an important component of phylogenomic study design due to rising concerns of the effects of gene tree estimation error, model misspecification, and data-type differences. Researchers must balance various trade-offs associated with locus length and evolutionary rate among other factors. The most commonly used reduced representation data sets for phylogenomics are ultraconserved elements (UCEs) and Anchored Hybrid Enrichment (AHE). Here, we introduce Rapidly Evolving Long Exon Capture (RELEC), a new set of loci that targets single exons that are both rapidly evolving (evolutionary rate faster than RAG1) and relatively long in length (>1,500 bp), while at the same time avoiding paralogy issues across amniotes. We compare the RELEC data set to UCEs and AHE in squamate reptiles by aligning and analyzing orthologous sequences from 17 squamate genomes, composed of 10 snakes and 7 lizards. The RELEC data set (179 loci) outperforms AHE and UCEs by maximizing per-locus genetic variation while maintaining presence and orthology across a range of evolutionary scales. RELEC markers show higher phylogenetic informativeness than UCE and AHE loci, and RELEC gene trees show greater similarity to the species tree than AHE or UCE gene trees. Furthermore, with fewer loci, RELEC remains computationally tractablemore »for full Bayesian coalescent species tree analyses. We contrast RELEC to and discuss important aspects of comparable methods, and demonstrate how RELEC may be the most effective set of loci for resolving difficult nodes and rapid radiations. We provide several resources for capturing or extracting RELEC loci from other amniote groups.« less