Exon probe sets and bioinformatics pipelines for all levels of fish phylogenomics

Hughes, Lily C.  (ORCID:0000000340064036); Ortí, Guillermo; Saad, Hadeel; Li, Chenhong  (ORCID:0000000330751756); White, William T.; Baldwin, Carole C.; Crandall, Keith A.  (ORCID:0000000208363389); Arcila, Dahiana; Betancur‐R, Ricardo

doi:10.1111/1755-0998.13287

Abstract

Exon markers have a long history of use in phylogenetics of ray‐finned fishes, the most diverse clade of vertebrates with more than 35,000 species. As the number of published genomes increases, it has become easier to test exons and other genetic markers for signals of ancient duplication events and filter out paralogues that can mislead phylogenetic analysis. We present seven new probe sets for current target‐capture phylogenomic protocols that capture 1,104 exons explicitly filtered for paralogues using gene trees. These seven probe sets span the diversity of teleost fishes, including four sets that target five hyperdiverse percomorph clades which together comprise ca. 17,000 species (Carangaria, Ovalentaria, Eupercaria, and Syngnatharia + Pelagiaria combined). We additionally included probes to capture legacy nuclear exons and mitochondrial markers that have been commonly used in fish phylogenetics (despite some exons being flagged for paralogues) to facilitate integration of old and new molecular phylogenetic matrices. We tested these probes experimentally for 56 fish species (eight species per probe set) and merged new exon‐capture sequence data into an existing data matrix of 1,104 exons and 300 ray‐finned fish species. We provide an optimized bioinformatics pipeline to assemble exon capture data from raw reads to alignments for downstream analysis. We show that legacy loci with known paralogues are at risk of assembling duplicated sequences with target‐capture, but we also assembled many useful orthologous sequences that can be integrated with many PCR‐generated matrices. These probe sets are a valuable resource for advancing fish phylogenomics because targeted exons can easily be extracted from increasingly available whole genome and transcriptome data sets, and also may be integrated with existing PCR‐based exon and mitochondrial data.

More Like this