Abstract PremiseTarget sequence capture (Hyb‐Seq) is a cost‐effective sequencing strategy that employs RNA probes to enrich for specific genomic sequences. By targeting conserved low‐copy orthologs, Hyb‐Seq enables efficient phylogenomic investigations. Here, we present Asparagaceae1726—a Hyb‐Seq probe set targeting 1726 low‐copy nuclear genes for phylogenomics in the angiosperm family Asparagaceae—which will aid the often‐challenging delineation and resolution of evolutionary relationships within Asparagaceae. MethodsHere we describe and validate the Asparagaceae1726 probe set (https://github.com/bentzpc/Asparagaceae1726) in six of the seven subfamilies of Asparagaceae. We perform phylogenomic analyses with these 1726 loci and evaluate how inclusion of paralogs and bycatch plastome sequences can enhance phylogenomic inference with target‐enriched data sets. ResultsWe recovered at least 82% of target orthologs from all sampled taxa, and phylogenomic analyses resulted in strong support for all subfamilial relationships. Additionally, topology and branch support were congruent between analyses with and without inclusion of target paralogs, suggesting that paralogs had limited effect on phylogenomic inference. DiscussionAsparagaceae1726 is effective across the family and enables the generation of robust data sets for phylogenomics of any Asparagaceae taxon. Asparagaceae1726 establishes a standardized set of loci for phylogenomic analysis in Asparagaceae, which we hope will be widely used for extensible and reproducible investigations of diversification in the family.
more »
« less
A New Pipeline for Removing Paralogs in Target Enrichment Data
Abstract Target enrichment (such as Hyb-Seq) is a well-established high throughput sequencing method that has been increasingly used for phylogenomic studies. Unfortunately, current widely used pipelines for analysis of target enrichment data do not have a vigorous procedure to remove paralogs in target enrichment data. In this study, we develop a pipeline we call Putative Paralogs Detection (PPD) to better address putative paralogs from enrichment data. The new pipeline is an add-on to the existing HybPiper pipeline, and the entire pipeline applies criteria in both sequence similarity and heterozygous sites at each locus in the identification of paralogs. Users may adjust the thresholds of sequence identity and heterozygous sites to identify and remove paralogs according to the level of phylogenetic divergence of their group of interest. The new pipeline also removes highly polymorphic sites attributed to errors in sequence assembly and gappy regions in the alignment. We demonstrated the value of the new pipeline using empirical data generated from Hyb-Seq and the Angiosperm 353 kit for two woody genera Castanea (Fagaceae, Fagales) and Hamamelis (Hamamelidaceae, Saxifragales). Comparisons of datasets showed that the PPD identified many more putative paralogs than the popular method HybPiper. Comparisons of tree topologies and divergence times showed evident differences between data from HybPiper and data from our new PPD pipeline. We further evaluated the accuracy and error rates of PPD by BLAST mapping of putative paralogous and orthologous sequences to a reference genome sequence of Castanea mollissima. Compared to HybPiper alone, PPD identified substantially more paralogous gene sequences that mapped to multiple regions of the reference genome (31 genes for PPD compared with 4 genes for HybPiper alone). In conjunction with HybPiper, paralogous genes identified by both pipelines can be removed resulting in the construction of more robust orthologous gene datasets for phylogenomic and divergence time analyses. Our study demonstrates the value of Hyb-Seq with data derived from the Angiosperm 353 probe set for elucidating species relationships within a genus, and argues for the importance of additional steps to filter paralogous genes and poorly aligned regions (e.g., as occur through assembly errors), such as our new PPD pipeline described in this study.
more »
« less
- Award ID(s):
- 1754376
- PAR ID:
- 10274850
- Date Published:
- Journal Name:
- Systematic Biology
- ISSN:
- 1063-5157
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract PremiseA family‐specific probe set for sunflowers, Compositae‐1061, enables family‐wide phylogenomic studies and investigations at lower taxonomic levels, but may lack resolution at genus to species levels, especially in groups complicated by polyploidy and hybridization. MethodsWe developed a Hyb‐Seq probe set, Compositae‐ParaLoss‐1272, that targets orthologous loci in Asteraceae. We tested its efficiency across the family by simulating target enrichment sequencing in silico. Additionally, we tested its effectiveness at lower taxonomic levels in the historically complex genusPackera. We performed Hyb‐Seq with Compositae‐ParaLoss‐1272 for 19Packerataxa that were previously studied using Compositae‐1061. The resulting sequences from each probe set, plus a combination of both, were used to generate phylogenies, compare topologies, and assess node support. ResultsWe report that Compositae‐ParaLoss‐1272 captured loci across all tested Asteraceae members, had less gene tree discordance, and retained longer loci than Compositae‐1061. Most notably, Compositae‐ParaLoss‐1272 recovered substantially fewer paralogous sequences than Compositae‐1061, with only ~5% of the recovered loci reporting as paralogous, compared to ~59% with Compositae‐1061. DiscussionGiven the complexity of plant evolutionary histories, assigning orthology for phylogenomic analyses will continue to be challenging. However, we anticipate Compositae‐ParaLoss‐1272 will provide improved resolution and utility for studies of complex groups and lower taxonomic levels in the sunflower family.more » « less
-
Abstract Gene duplication is a source of evolutionary novelty. DNA methylation may play a role in the evolution of duplicate genes (paralogs) through its association with gene expression. While this relationship has been examined to varying extents in a few individual species, the generalizability of these results at either a broad phylogenetic scale with species of differing duplication histories or across a population remains unknown. We applied a comparative epigenomic approach to 43 angiosperm species across the phylogeny and a population of 928 Arabidopsis (Arabidopsis thaliana) accessions, examining the association of DNA methylation with paralog evolution. Genic DNA methylation was differentially associated with duplication type, the age of duplication, sequence evolution, and gene expression. Whole-genome duplicates were typically enriched for CG-only gene body methylated or unmethylated genes, while single-gene duplications were typically enriched for non-CG methylated or unmethylated genes. Non-CG methylation, in particular, was a characteristic of more recent single-gene duplicates. Core angiosperm gene families were differentiated into those which preferentially retain paralogs and “duplication-resistant” families, which convergently reverted to singletons following duplication. Duplication-resistant families that still have paralogous copies were, uncharacteristically for core angiosperm genes, enriched for non-CG methylation. Non-CG methylated paralogs had higher rates of sequence evolution, higher frequency of presence–absence variation, and more limited expression. This suggests that silencing by non-CG methylation may be important to maintaining dosage following duplication and be a precursor to fractionation. Our results indicate that genic methylation marks differing evolutionary trajectories and fates between paralogous genes and have a role in maintaining dosage following duplication.more » « less
-
PremiseHybrid capture with high‐throughput sequencing (Hyb‐Seq) is a powerful tool for evolutionary studies. The applicability of an Asteraceae family‐specific Hyb‐Seq probe set and the outcomes of different phylogenetic analyses are investigated here. MethodsHyb‐Seq data from 112 Asteraceae samples were organized into groups at different taxonomic levels (tribe, genus, and species). For each group, data sets of non‐paralogous loci were built and proportions of parsimony informative characters estimated. The impacts of analyzing alternative data sets, removing long branches, and type of analysis on tree resolution and inferred topologies were investigated in tribe Cichorieae. ResultsAlignments of the Asteraceae family‐wide Hyb‐Seq locus set were parsimony informative at all taxonomic levels. Levels of resolution and topologies inferred at shallower nodes differed depending on the locus data set and the type of analysis, and were affected by the presence of long branches. DiscussionThe approach used to build a Hyb‐Seq locus data set influenced resolution and topologies inferred in phylogenetic analyses. Removal of long branches improved the reliability of topological inferences in maximum likelihood analyses. The Astereaceae Hyb‐Seq probe set is applicable at multiple taxonomic depths, which demonstrates that probe sets do not necessarily need to be lineage‐specific.more » « less
-
PremisePhylogenetic relationships within major angiosperm clades are increasingly well resolved, but largely informed by plastid data. Areas of poor resolution persist within the Dipsacales, including placement ofHeptacodiumandZabelia, and relationships within the Caprifolieae and Linnaeeae, hindering our interpretation of morphological evolution. Here, we sampled a significant number of nuclear loci using a Hyb‐Seq approach and used these data to infer the Dipsacales phylogeny and estimate divergence times. MethodsSampling all major clades within the Dipsacales, we applied the Angiosperms353 probe set to 96 species. Data were filtered based on locus completeness and taxon recovery per locus, and trees were inferred using RAxML and ASTRAL. Plastid loci were assembled from off‐target reads, and 10 fossils were used to calibrate dated trees. ResultsVarying numbers of targeted loci and off‐target plastomes were recovered from most taxa. Nuclear and plastid data confidently placeHeptacodiumwith Caprifolieae, implying homoplasy in calyx morphology, ovary development, and fruit type. Placement ofZabelia, and relationships within the Caprifolieae and Linnaeeae, remain uncertain. Dipsacales diversification began earlier than suggested by previous angiosperm‐wide dating analyses, but many major splitting events date to the Eocene. ConclusionsThe Angiosperms353 probe set facilitated the assembly of a large, single‐copy nuclear dataset for the Dipsacales. Nevertheless, many relationships remain unresolved, and resolution was poor for woody clades with low rates of molecular evolution. We favor expanding the Angiosperms353 probe set to include more variable loci and loci of special interest, such as developmental genes, within particular clades.more » « less
An official website of the United States government

