skip to main content


Title: Exon probe sets and bioinformatics pipelines for all levels of fish phylogenomics
Abstract

Exon markers have a long history of use in phylogenetics of ray‐finned fishes, the most diverse clade of vertebrates with more than 35,000 species. As the number of published genomes increases, it has become easier to test exons and other genetic markers for signals of ancient duplication events and filter out paralogues that can mislead phylogenetic analysis. We present seven new probe sets for current target‐capture phylogenomic protocols that capture 1,104 exons explicitly filtered for paralogues using gene trees. These seven probe sets span the diversity of teleost fishes, including four sets that target five hyperdiverse percomorph clades which together comprise ca. 17,000 species (Carangaria, Ovalentaria, Eupercaria, and Syngnatharia + Pelagiaria combined). We additionally included probes to capture legacy nuclear exons and mitochondrial markers that have been commonly used in fish phylogenetics (despite some exons being flagged for paralogues) to facilitate integration of old and new molecular phylogenetic matrices. We tested these probes experimentally for 56 fish species (eight species per probe set) and merged new exon‐capture sequence data into an existing data matrix of 1,104 exons and 300 ray‐finned fish species. We provide an optimized bioinformatics pipeline to assemble exon capture data from raw reads to alignments for downstream analysis. We show that legacy loci with known paralogues are at risk of assembling duplicated sequences with target‐capture, but we also assembled many useful orthologous sequences that can be integrated with many PCR‐generated matrices. These probe sets are a valuable resource for advancing fish phylogenomics because targeted exons can easily be extracted from increasingly available whole genome and transcriptome data sets, and also may be integrated with existing PCR‐based exon and mitochondrial data.

 
more » « less
NSF-PAR ID:
10454431
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Molecular Ecology Resources
Volume:
21
Issue:
3
ISSN:
1755-098X
Page Range / eLocation ID:
p. 816-833
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Custom sequence capture experiments are becoming an efficient approach for gathering large sets of orthologous markers in nonmodel organisms. Transcriptome‐based exon capture utilizes transcript sequences to design capture probes, typically using a reference genome to identify intron–exon boundaries to exclude shorter exons (<200 bp). Here, we test directly using transcript sequences for probe design, which are often composed of multiple exons of varying lengths. Using 1260 orthologous transcripts, we conducted sequence captures across multiple phylogenetic scales for frogs, including outgroups ~100 Myr divergent from the ingroup. We recovered a large phylogenomic data set consisting of sequence alignments for 1047 of the 1260 transcriptome‐based loci (~561 000 bp) and a large quantity of highly variable regions flanking the exons in transcripts (~70 000 bp), the latter improving substantially by only including ingroup species (~797 000 bp). We recovered both shorter (<100 bp) and longer exons (>200 bp), with no major reduction in coverage towards the ends of exons. We observed significant differences in the performance of blocking oligos for target enrichment and nontarget depletion during captures, and differences inPCRduplication rates resulting from the number of individuals pooled for capture reactions. We explicitly tested the effects of phylogenetic distance on capture sensitivity, specificity, and missing data, and provide a baseline estimate of expectations for these metrics based on a priori knowledge of nuclear pairwise differences among samples. We provide recommendations for transcriptome‐based exon capture design based on our results, cost estimates and offer multiple pipelines for data assembly and analysis.

     
    more » « less
  2. Abstract

    Sequence capture studies result in rich data sets comprising hundreds to thousands of targeted genomic regions that are superseding Sanger‐based data sets comprised of a few well‐known loci with historical uses in phylogenetics (‘legacy loci’). However, integrating sequence capture and Sanger‐based data sets is of interest as legacy loci can include different types of loci (e.g. mitochondrial and nuclear) across a potentially larger sample of species from past studies. Sequence capture data sets include nontargeted sequences, and there has been recent interest in extracting legacy loci from invertebrate data sets. Here, we use published legacy data from leaf‐footed bugs (Hemiptera: Coreoidea) to recover 15 mitochondrial and seven nuclear legacy loci from off‐target sequences in a sequence capture data set, explore approaches to improve legacy locus recovery, and combine these loci with sequence capture data for phylogenetic analysis. Two nuclear loci were determined to already be targeted by sequence capture baits. Most of the remaining loci were successfully recovered from off‐target sequences, but this recovery varied greatly. Additionally, complementing complete mitogenomes with additional reference mitochondrial sequences from a genetic depository did not offer improvement for most of our taxa; however, supplementing these reference sequences with extracted legacy loci offered ≥6% improvement across taxa for a given mitochondrial locus (negligible improvement for nuclear loci). Phylogenetic analysis of legacy and sequence capture data produced a topology generally congruent with recent studies, but support was lower. Thus, future studies may employ the approaches used in this study to integrate legacy data with newly generated sequence capture data sets without added expenses.

     
    more » « less
  3. Abstract

    Phylogenomic analysis of large genome-wide sequence data sets can resolve phylogenetic tree topologies for large species groups, help test the accuracy of and improve resolution for earlier multi-locus studies and reveal the level of agreement or concordance within partitions of the genome for various tree topologies. Here we used a target-capture approach to sequence 1088 single-copy exons for more than 200 labrid fishes together with more than 100 outgroup taxa to generate a new data-rich phylogeny for the family Labridae. Our time-calibrated phylogenetic analysis of exon-capture data pushes the root node age of the family Labridae back into the Cretaceous to about 79 Ma years ago. The monotypic Centrogenys vaigiensis, and the order Uranoscopiformes (stargazers) are identified as the sister lineages of Labridae. The phylogenetic relationships among major labrid subfamilies and within these clades were largely congruent with prior analyses of select mitochondrial and nuclear datasets. However, the position of the tribe Cirrhilabrini (fairy and flame wrasses) showed discordance, resolving either as the sister to a crown julidine clade or alternatively sister to a group formed by the labrines, cheilines and scarines. Exploration of this pattern using multiple approaches leads to slightly higher support for this latter hypothesis, highlighting the importance of genome-level data sets for resolving short internodes at key phylogenetic positions in a large, economically important groups of coral reef fishes. More broadly, we demonstrate how accounting for sources of biological variability from incomplete lineage sorting and exploring systematic error at conflicting nodes can aid in evaluating alternative phylogenetic hypotheses. [coral reefs; divergence time estimation; exon-capture; fossil calibration; incomplete lineage sorting.]

     
    more » « less
  4. Wiegmann, Brian (Ed.)
    Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.] 
    more » « less
  5. Abstract Marker selection has emerged as an important component of phylogenomic study design due to rising concerns of the effects of gene tree estimation error, model misspecification, and data-type differences. Researchers must balance various trade-offs associated with locus length and evolutionary rate among other factors. The most commonly used reduced representation data sets for phylogenomics are ultraconserved elements (UCEs) and Anchored Hybrid Enrichment (AHE). Here, we introduce Rapidly Evolving Long Exon Capture (RELEC), a new set of loci that targets single exons that are both rapidly evolving (evolutionary rate faster than RAG1) and relatively long in length (>1,500 bp), while at the same time avoiding paralogy issues across amniotes. We compare the RELEC data set to UCEs and AHE in squamate reptiles by aligning and analyzing orthologous sequences from 17 squamate genomes, composed of 10 snakes and 7 lizards. The RELEC data set (179 loci) outperforms AHE and UCEs by maximizing per-locus genetic variation while maintaining presence and orthology across a range of evolutionary scales. RELEC markers show higher phylogenetic informativeness than UCE and AHE loci, and RELEC gene trees show greater similarity to the species tree than AHE or UCE gene trees. Furthermore, with fewer loci, RELEC remains computationally tractable for full Bayesian coalescent species tree analyses. We contrast RELEC to and discuss important aspects of comparable methods, and demonstrate how RELEC may be the most effective set of loci for resolving difficult nodes and rapid radiations. We provide several resources for capturing or extracting RELEC loci from other amniote groups. 
    more » « less