skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: P ool P arty 2: An integrated pipeline for analysing pooled or indexed low‐coverage whole‐genome sequencing data to discover the genetic basis of diversity
Abstract Whole‐genome sequencing data allow survey of variation from across the genome, reducing the constraint of balancing genome sub‐sampling with estimating recombination rates and linkage between sampled markers and target loci. As sequencing costs decrease, low‐coverage whole‐genome sequencing of pooled or indexed‐individual samples is commonly utilized to identify loci associated with phenotypes or environmental axes in non‐model organisms. There are, however, relatively few publicly available bioinformatic pipelines designed explicitly to analyse these types of data, and fewer still that process the raw sequencing data, provide useful metrics of quality control and then execute analyses. Here, we present an updated version of a bioinformatics pipeline calledPoolParty2that can effectively handle either pooled or indexed DNA samples and includes new features to improve computational efficiency. Using simulated data, we demonstrate the ability of our pipeline to recover segregating variants, estimate their allele frequencies accurately, and identify genomic regions harbouring loci under selection. Based on the simulated data set, we benchmark the efficacy of our pipeline with another bioinformatic suite,angsd, and illustrate the compatibility and complementarity of these suites usingangsdto generate genotype likelihoods as input for identifying linkage outlier regions using alignment files and variants provided byPoolParty2. Finally, we apply our updated pipeline to an empirical dataset of low‐coverage whole genomic data from population samples of Columbia River steelhead trout (Oncorhynchus mykiss), results from which demonstrate the genomic impacts of decades of artificial selection in a prominent hatchery stock. Thus, we not only demonstrate the utility ofPoolParty2for genomic studies that combine sequencing data from multiple individuals, but also illustrate how it compliments other bioinformatics resources such asangsd.  more » « less
Award ID(s):
1757324
PAR ID:
10495578
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Molecular Ecology Resources
ISSN:
1755-098X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Nielsen, Rasmus (Ed.)
    Abstract Drosophila melanogaster is a leading model in population genetics and genomics, and a growing number of whole-genome data sets from natural populations of this species have been published over the last years. A major challenge is the integration of disparate data sets, often generated using different sequencing technologies and bioinformatic pipelines, which hampers our ability to address questions about the evolution of this species. Here we address these issues by developing a bioinformatics pipeline that maps pooled sequencing (Pool-Seq) reads from D. melanogaster to a hologenome consisting of fly and symbiont genomes and estimates allele frequencies using either a heuristic (PoolSNP) or a probabilistic variant caller (SNAPE-pooled). We use this pipeline to generate the largest data repository of genomic data available for D. melanogaster to date, encompassing 271 previously published and unpublished population samples from over 100 locations in >20 countries on four continents. Several of these locations have been sampled at different seasons across multiple years. This data set, which we call Drosophila Evolution over Space and Time (DEST), is coupled with sampling and environmental metadata. A web-based genome browser and web portal provide easy access to the SNP data set. We further provide guidelines on how to use Pool-Seq data for model-based demographic inference. Our aim is to provide this scalable platform as a community resource which can be easily extended via future efforts for an even more extensive cosmopolitan data set. Our resource will enable population geneticists to analyze spatiotemporal genetic patterns and evolutionary dynamics of D. melanogaster populations in unprecedented detail. 
    more » « less
  2. Abstract The capability to generate densely sampled single nucleotide polymorphism (SNP) data is essential in diverse subdisciplines of biology, including crop breeding, pathology, forensics, forestry, ecology, evolution and conservation. However, the wet‐laboratory expertise and bioinformatics training required to conduct genome‐scale variant discovery remain limiting factors for investigators with limited resources.Here we present ISSRseq, a PCR‐based method for reduced representation of genomic variation using simple sequence repeats as priming sites to sequence inter simple sequence repeat (ISSR) regions. Briefly, ISSR regions are amplified with single primers, pooled, used to construct sequencing libraries with a commercially available kit, and sequenced on the Illumina platform. We also present a flexible bioinformatic pipeline that assembles ISSR loci, calls and hard filters variants, outputs data matrices in common formats, and conducts population analyses using R.Using three angiosperm species as case studies, we demonstrate that ISSRseq is highly repeatable, necessitates only simple wet‐laboratory skills and commonplace instrumentation, is flexible in terms of the number of single primers used, and can generate genomic‐scale variant discovery on par with existing RRS methods which require more complex wet‐laboratory procedures.ISSRseq represents a straightforward approach to SNP genotyping in any organism, and we predict that this method will be particularly useful for those studying population genomics and phylogeography of non‐model organisms. Furthermore, the ease of ISSRseq relative to other RRS methods should prove useful to those lacking advanced expertise in wet‐laboratory methods or bioinformatics. 
    more » « less
  3. Abstract Multi‐locus sequence data are widely used in fungal systematic and taxonomic studies to delimit species and infer evolutionary relationships. We developed and assessed the efficacy of a multi‐locus pooled sequencing method using PacBio long‐read high‐throughput sequencing. Samples included fresh and dried voucher specimens, cultures and archival DNA extracts of Agaricomycetes with an emphasis on the order Cantharellales. Of the 283 specimens sequenced, 93.6% successfully amplified at one or more loci with a mean of 3.3 loci amplified. Our method recovered multiple sequence variants representing alleles of rDNA loci and single copy protein‐coding genesrpb1,rpb2 andtef1. Within‐sample genetic variation differed by locus and taxonomic group, with the greatest genetic divergence observed among sequence variants ofrpb2 andtef1 from corticioid Cantharellales. Our method is a cost‐effective approach for generating accurate multi‐locus sequence data coupled with recovery of alleles from polymorphic samples and multi‐organism specimens. These results have important implications for understanding intra‐individual genomic variation among genetic loci commonly used in species delimitation of fungi. 
    more » « less
  4. Abstract The development of next-generation sequencing (NGS) enabled a shift from array-based genotyping to directly sequencing genomic libraries for high-throughput genotyping. Even though whole-genome sequencing was initially too costly for routine analysis in large populations such as breeding or genetic studies, continued advancements in genome sequencing and bioinformatics have provided the opportunity to capitalize on whole-genome information. As new sequencing platforms can routinely provide high-quality sequencing data for sufficient genome coverage to genotype various breeding populations, a limitation comes in the time and cost of library construction when multiplexing a large number of samples. Here we describe a high-throughput whole-genome skim-sequencing (skim-seq) approach that can be utilized for a broad range of genotyping and genomic characterization. Using optimized low-volume Illumina Nextera chemistry, we developed a skim-seq method and combined up to 960 samples in one multiplex library using dual index barcoding. With the dual-index barcoding, the number of samples for multiplexing can be adjusted depending on the amount of data required, and could be extended to 3,072 samples or more. Panels of doubled haploid wheat lines (Triticum aestivum, CDC Stanley x CDC Landmark), wheat-barley (T.aestivumxHordeum vulgare) and wheat-wheatgrass (Triticum durum x Thinopyrum intermedium) introgression lines as well as known monosomic wheat stocks were genotyped using the skim-seq approach. Bioinformatics pipelines were developed for various applications where sequencing coverage ranged from 1 × down to 0.01 × per sample. Using reference genomes, we detected chromosome dosage, identified aneuploidy, and karyotyped introgression lines from the skim-seq data. Leveraging the recent advancements in genome sequencing, skim-seq provides an effective and low-cost tool for routine genotyping and genetic analysis, which can track and identify introgressions and genomic regions of interest in genetics research and applied breeding programs. 
    more » « less
  5. Abstract Over the past few decades, there has been an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining data sets to achieve unprecedented sample sizes, spatial coverage or temporal replication in population genomic studies. However, a common concern is that nonbiological differences between data sets may generate patterns of variation in the data that can confound real biological patterns, a problem known as batch effects. In this paper, we compare two batches of low‐coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch‐effect‐naive” bioinformatic pipeline, batch effects systematically biased our genetic diversity estimates, population structure inference and selection scans. We then demonstrate that these batch effects resulted from multiple technical differences between our data sets, including the sequencing chemistry (four‐channel vs. two‐channel), sequencing run, read type (single‐end vs. paired‐end), read length (125 vs. 150 bp), DNA degradation level (degraded vs. well preserved) and sequencing depth (0.8× vs. 0.3× on average). Lastly, we illustrate that a set of simple bioinformatic strategies (such as different read trimming and single nucleotide polymorphism filtering) can be used to detect batch effects in our data and substantially mitigate their impact. We conclude that combining data sets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies. 
    more » « less