skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Detecting and Removing Sample Contamination in Phylogenomic Data: An Example and its Implications for Cicadidae Phylogeny (Insecta: Hemiptera)
Abstract Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]  more » « less
Award ID(s):
1655891
PAR ID:
10368842
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Wiegmann, Brian (Ed.)
    Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.] 
    more » « less
  2. Abstract A phylogenomic analysis of the so far phylogenetically unresolved subfamily Bromelioideae (Bromeliaceae) was performed to infer species relationships as the basis for future taxonomic treatment, stabilization of generic concept, and further analyses of evolution and biogeography of the subfamily. A target‐enrichment approach was chosen, using the Angiosperms353 v.4 kit RNA‐baits and including 86 Bromelioideae species representing previously identified major evolutionary lineages. Phylogenetic analyses were based on 125 target nuclear loci, assembled off‐target plastome as well as mitogenome reads. A Bromelioideae phylogeny with a mostly well‐resolved backbone is provided based on nuclear (194 kbp), plastome (109 kbp), and mitogenome data (34 kbp). For the nuclear markers, a coalescent‐based analysis of single‐locus gene trees was performed as well as a supermatrix analysis of concatenated gene alignments. Nuclear and plastome datasets provide well‐resolved trees, which showed only minor topological incongruences. The mitogenome tree is not sufficiently resolved. A total of 26 well‐supported clades were identified. The generaAechmea,Canistrum,Hohenbergia,Neoregelia, andQuesneliawere revealed polyphyletic. In core Bromelioideae,Acanthostachysis sister to the remainder. Among the 26 recognized clades, 12 correspond with currently employed taxonomic concepts. Hence, the presented phylogenetic framework will serve as an important basis for future taxonomic revisions as well as to better understand the evolutionary drivers and processes in this exciting subfamily. 
    more » « less
  3. Abstract Snake venoms are complex mixtures of toxic proteins that hold significant medical, pharmacological and evolutionary interest. To better understand the genetic diversity underlying snake venoms, we developed VenomCap, a novel exon‐capture probe set targeting toxin‐coding genes from a wide range of elapid snakes, with a particular focus on the ecologically diverse and medically important subfamily Hydrophiinae. We tested the capture success of VenomCap across 24 species, representing all major elapid lineages. We included snake phylogenomic probes in the VenomCap capture set, allowing us to compare capture performance between venom and phylogenomic loci and to infer elapid phylogenetic relationships. We demonstrated VenomCap's ability to recover exons from ~1500 target markers, representing a total of 24 known venom gene families, which includes the dominant gene families found in elapid venoms. We find that VenomCap's capture results are robust across all elapids sampled, and especially among hydrophiines, with respect to measures of target capture success (target loci matched, sensitivity, specificity and missing data). As a cost‐effective and efficient alternative to full genome sequencing, VenomCap can dramatically accelerate the sequencing and analysis of venom gene families. Overall, our tool offers a model for genomic studies on snake venom gene diversity and evolution that can be expanded for comprehensive comparisons across the other families of venomous snakes. 
    more » « less
  4. Springer, Mark (Ed.)
    Abstract Despite the increasing feasibility of sequencing whole genomes from diverse taxa, a persistent problem in phylogenomics is the selection of appropriate genetic markers or loci for a given taxonomic group or research question. In this review, we aim to streamline the decision-making process when selecting specific markers to use in phylogenomic studies by introducing commonly used types of genomic markers, their evolutionary characteristics, and their associated uses in phylogenomics. Specifically, we review the utilities of ultraconserved elements (including flanking regions), anchored hybrid enrichment loci, conserved nonexonic elements, untranslated regions, introns, exons, mitochondrial DNA, single nucleotide polymorphisms, and anonymous regions (nonspecific regions that are evenly or randomly distributed across the genome). These various genomic elements and regions differ in their substitution rates, likelihood of neutrality or of being strongly linked to loci under selection, and mode of inheritance, each of which are important considerations in phylogenomic reconstruction. These features may give each type of marker important advantages and disadvantages depending on the biological question, number of taxa sampled, evolutionary timescale, cost effectiveness, and analytical methods used. We provide a concise outline as a resource to efficiently consider key aspects of each type of genetic marker. There are many factors to consider when designing phylogenomic studies, and this review may serve as a primer when weighing options between multiple potential phylogenomic markers. 
    more » « less
  5. O'Connell, Mary (Ed.)
    Abstract The data available for reconstructing molecular phylogenies have become wildly disparate. Phylogenomic studies can generate data for thousands of genetic markers for dozens of species, but for hundreds of other taxa, data may be available from only a few genes. Can these two types of data be integrated to combine the advantages of both, addressing the relationships of hundreds of species with thousands of genes? Here, we show that this is possible, using data from frogs. We generated a phylogenomic data set for 138 ingroup species and 3,784 nuclear markers (ultraconserved elements [UCEs]), including new UCE data from 70 species. We also assembled a supermatrix data set, including data from 97% of frog genera (441 total), with 1–307 genes per taxon. We then produced a combined phylogenomic–supermatrix data set (a “gigamatrix”) containing 441 ingroup taxa and 4,091 markers but with 86% missing data overall. Likelihood analysis of the gigamatrix yielded a generally well-supported tree among families, largely consistent with trees from the phylogenomic data alone. All terminal taxa were placed in the expected families, even though 42.5% of these taxa each had >99.5% missing data and 70.2% had >90% missing data. Our results show that missing data need not be an impediment to successfully combining very large phylogenomic and supermatrix data sets, and they open the door to new studies that simultaneously maximize sampling of genes and taxa. 
    more » « less