skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: A beginner's guide to low‐coverage whole genome sequencing for population genomics
Abstract

Low‐coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost‐effective approach for population genomic studies in both model and nonmodel species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data, and how the distribution of sequencing effort between the number of samples analysed and per‐sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate how the per‐sample cost for lcWGS is now comparable to RAD‐seq and Pool‐seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency, genetic diversity, and linkage disequilibrium estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference, with a few notable exceptions. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in nonmodel species, and discuss current limitations and future perspectives for lcWGS‐based population genomics research. With this overview, we hope to make lcWGS more approachable and stimulate its broader adoption.

 
more » « less
Award ID(s):
1756316
NSF-PAR ID:
10370252
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Molecular Ecology
Volume:
30
Issue:
23
ISSN:
0962-1083
Page Range / eLocation ID:
p. 5966-5993
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The development of next-generation sequencing (NGS) enabled a shift from array-based genotyping to directly sequencing genomic libraries for high-throughput genotyping. Even though whole-genome sequencing was initially too costly for routine analysis in large populations such as breeding or genetic studies, continued advancements in genome sequencing and bioinformatics have provided the opportunity to capitalize on whole-genome information. As new sequencing platforms can routinely provide high-quality sequencing data for sufficient genome coverage to genotype various breeding populations, a limitation comes in the time and cost of library construction when multiplexing a large number of samples. Here we describe a high-throughput whole-genome skim-sequencing (skim-seq) approach that can be utilized for a broad range of genotyping and genomic characterization. Using optimized low-volume Illumina Nextera chemistry, we developed a skim-seq method and combined up to 960 samples in one multiplex library using dual index barcoding. With the dual-index barcoding, the number of samples for multiplexing can be adjusted depending on the amount of data required, and could be extended to 3,072 samples or more. Panels of doubled haploid wheat lines (Triticum aestivum, CDC Stanley x CDC Landmark), wheat-barley (T.aestivumxHordeum vulgare) and wheat-wheatgrass (Triticum durum x Thinopyrum intermedium) introgression lines as well as known monosomic wheat stocks were genotyped using the skim-seq approach. Bioinformatics pipelines were developed for various applications where sequencing coverage ranged from 1 × down to 0.01 × per sample. Using reference genomes, we detected chromosome dosage, identified aneuploidy, and karyotyped introgression lines from the skim-seq data. Leveraging the recent advancements in genome sequencing, skim-seq provides an effective and low-cost tool for routine genotyping and genetic analysis, which can track and identify introgressions and genomic regions of interest in genetics research and applied breeding programs.

     
    more » « less
  2. Abstract

    Monitoring genetic diversity in wild populations is a central goal of ecological and evolutionary genetics and is critical for conservation biology. However, genetic studies of nonmodel organisms generally lack access to species‐specific genotyping methods (e.g. array‐based genotyping) and must instead use sequencing‐based approaches. Although costs are decreasing, high‐coverage whole‐genome sequencing (WGS), which produces the highest confidence genotypes, remains expensive. More economical reduced representation sequencing approaches fail to capture much of the genome, which can hinder downstream inference. Low‐coverage WGS combined with imputation using a high‐confidence reference panel is a cost‐effective alternative, but the accuracy of genotyping using low‐coverage WGS and imputation in nonmodel populations is still largely uncharacterized. Here, we empirically tested the accuracy of low‐coverage sequencing (0.1–10×) and imputation in two natural populations, one with a large (n = 741) reference panel, rhesus macaques (Macaca mulatta), and one with a smaller (n = 68) reference panel, gelada monkeys (Theropithecus gelada). Using samples sequenced to coverage as low as 0.5×, we could impute genotypes at >95% of the sites in the reference panel with high accuracy (medianr2 ≥ 0.92). We show that low‐coverage imputed genotypes can reliably calculate genetic relatedness and population structure. Based on these data, we also provide best practices and recommendations for researchers who wish to deploy this approach in other populations, with all code available on GitHub (https://github.com/mwatowich/LoCSI‐for‐non‐model‐species). Our results endorse accurate and effective genotype imputation from low‐coverage sequencing, enabling the cost‐effective generation of population‐scale genetic datasets necessary for tackling many pressing challenges of wildlife conservation.

     
    more » « less
  3. Abstract

    Minimally invasive samples are often the best option for collecting genetic material from species of conservation concern, but they perform poorly in many genomic sequencing methods due to their tendency to yield low DNA quality and quantity. Genotyping‐in‐thousands by sequencing (GT‐seq) is a powerful amplicon sequencing method that can genotype large numbers of variable‐quality samples at a standardized set of single nucleotide polymorphism (SNP) loci. Here, we develop, optimize, and validate a GT‐seq panel for the federally threatened northern Idaho ground squirrel (Urocitellus brunneus) to provide a standardized approach for future genetic monitoring and assessment of recovery goals using minimally invasive samples. The optimized panel consists of 224 neutral and 81 putatively adaptive SNPs. DNA collected from buccal swabs from 2016 to 2020 had 73% genotyping success, while samples collected from hair from 2002 to 2006 had little to no DNA remaining and did not genotype successfully. We evaluated our GT‐seq panel by measuring genotype discordance rates compared to RADseq and whole‐genome sequencing. GT‐seq and other sequencing methods had similar population diversity andFSTestimates, but GT‐seq consistently called more heterozygotes than expected, resulting in negativeFISvalues at the population level. Genetic ancestry assignment was consistent when estimated with different sequencing methods and numbers of loci. Our GT‐seq panel is an effective and efficient genotyping tool that will aid in the monitoring and recovery of this threatened species, and our results provide insights for applying GT‐seq for minimally invasive DNA sampling techniques in other rare animals.

     
    more » « less
  4. Abstract

    Genomic methods are becoming increasingly valuable and established in ecological research, particularly in nonmodel species. Supporting their progress and adoption requires investment in resources that promote (i) reproducibility of genomic analyses, (ii) accessibility of learning tools and (iii) keeping pace with rapidly developing methods and principles.

    We introduce marineomics.io, an open‐source, living document to disseminate tutorials, reproducibility tools and best principles for ecological genomic research in marine and nonmodel systems.

    The website's existing content spans population and functional genomics, including current recommendations for whole‐genome sequencing, RAD‐seq, Pool‐seq and RNA‐seq. With the goal to facilitate the development of new, similar resources, we describe our process for aggregating and synthesizing methodological principles from the ecological genomics community to inform website content. We also detail steps for authorship and submission of new website content, as well as protocols for providing feedback and topic requests from the community.

    These web resources were constructed with guidance for doing rigorous, reproducible science. Collaboration and contributions to the website are encouraged from scientists of all skill sets and levels of expertise.

     
    more » « less
  5. Abstract

    Low‐coverage whole‐genome sequencing (WGS) is increasingly used for the study of evolution and ecology in both model and non‐model organisms; however, effective application of low‐coverage WGS data requires the implementation of probabilistic frameworks to account for the uncertainties in genotype likelihoods.

    Here, we present a probabilistic framework for using genotype likelihoods for standard population assignment applications. Additionally, we derive the Fisher information for allele frequency from genotype likelihoods and use that to describe a novel metric, theeffective sample size, which figures heavily in assignment accuracy. We make these developments available for application through WGSassign, an open‐source software package that is computationally efficient for working with whole‐genome data.

    Using simulated and empirical data sets, we demonstrate the behaviour of our assignment method across a range of population structures, sample sizes and read depths. Through these results, we show that WGSassign can provide highly accurate assignment, even for samples with low average read depths (<0.01X) and among weakly differentiated populations.

    Our simulation results highlight the importance of equalizing the effective sample sizes among source populations in order to achieve accurate population assignment with low‐coverage WGS data. We further provide study design recommendations for population assignment studies and discuss the broad utility of effective sample size for studies using low‐coverage WGS data.

     
    more » « less