skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Population assignment from genotype likelihoods for low‐coverage whole‐genome sequencing data
Abstract Low‐coverage whole‐genome sequencing (WGS) is increasingly used for the study of evolution and ecology in both model and non‐model organisms; however, effective application of low‐coverage WGS data requires the implementation of probabilistic frameworks to account for the uncertainties in genotype likelihoods.Here, we present a probabilistic framework for using genotype likelihoods for standard population assignment applications. Additionally, we derive the Fisher information for allele frequency from genotype likelihoods and use that to describe a novel metric, theeffective sample size, which figures heavily in assignment accuracy. We make these developments available for application through WGSassign, an open‐source software package that is computationally efficient for working with whole‐genome data.Using simulated and empirical data sets, we demonstrate the behaviour of our assignment method across a range of population structures, sample sizes and read depths. Through these results, we show that WGSassign can provide highly accurate assignment, even for samples with low average read depths (<0.01X) and among weakly differentiated populations.Our simulation results highlight the importance of equalizing the effective sample sizes among source populations in order to achieve accurate population assignment with low‐coverage WGS data. We further provide study design recommendations for population assignment studies and discuss the broad utility of effective sample size for studies using low‐coverage WGS data.  more » « less
Award ID(s):
1942313
PAR ID:
10488542
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Methods in Ecology and Evolution
Volume:
15
Issue:
3
ISSN:
2041-210X
Format(s):
Medium: X Size: p. 493-510
Size(s):
p. 493-510
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Low‐coverage whole genome sequencing (lcWGS) has emerged as a powerful and cost‐effective approach for population genomic studies in both model and nonmodel species. However, with read depths too low to confidently call individual genotypes, lcWGS requires specialized analysis tools that explicitly account for genotype uncertainty. A growing number of such tools have become available, but it can be difficult to get an overview of what types of analyses can be performed reliably with lcWGS data, and how the distribution of sequencing effort between the number of samples analysed and per‐sample sequencing depths affects inference accuracy. In this introductory guide to lcWGS, we first illustrate how the per‐sample cost for lcWGS is now comparable to RAD‐seq and Pool‐seq in many systems. We then provide an overview of software packages that explicitly account for genotype uncertainty in different types of population genomic inference. Next, we use both simulated and empirical data to assess the accuracy of allele frequency, genetic diversity, and linkage disequilibrium estimation, detection of population structure, and selection scans under different sequencing strategies. Our results show that spreading a given amount of sequencing effort across more samples with lower depth per sample consistently improves the accuracy of most types of inference, with a few notable exceptions. Finally, we assess the potential for using imputation to bolster inference from lcWGS data in nonmodel species, and discuss current limitations and future perspectives for lcWGS‐based population genomics research. With this overview, we hope to make lcWGS more approachable and stimulate its broader adoption. 
    more » « less
  2. Stamatakis, Alexandros (Ed.)
    Abstract Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome (‘samples-specific’ strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information Supplementary data are available at Bioinformatics Advances online. 
    more » « less
  3. Abstract A prominent challenge for managing migratory species is the development of conservation plans that accommodate spatiotemporally varying distributions throughout the year. Migratory networks are spatially‐explicit models that incorporate migratory assignment and seasonal abundance data to define patterns of connectivity between stages of the annual cycle. These models are particularly useful for widespread application because different types of migratory data can be used to quantify individual and population‐level movement across the annual cycle of migratory species. While there are clear benefits of combining migratory assignment and abundance data for the development of conservation strategies, there is a concurrent need for corresponding user‐friendly software to facilitate the integration of these data for conservation.Here, we presentmignette(migratory network tools ensemble), an R package for developing migratory network models to estimate network connectivity among migratory populations. We demonstrate the functionality ofmignettewith three empirical examples that highlight the use of different types of tracking data for migratory assignment.mignettefacilitates the modelling of migratory networks by providing R functions to: (1) define breeding and nonbreeding nodes, (2) assemble abundance and assignment data and (3) model the migratory network. Additionally,mignetteprovides R functions to visualize modelled migratory networks.With increasing availability of migratory assignment and abundance data,mignetterepresents a valuable tool for developing effective conservation strategies for migratory species. 
    more » « less
  4. Abstract Objective:Whole genome sequencing (WGS) can help identify transmission of pathogens causing healthcare-associated infections (HAIs). However, the current gold standard of short-read, Illumina-based WGS is labor and time intensive. Given recent improvements in long-read Oxford Nanopore Technologies (ONT) sequencing, we sought to establish a low resource approach providing accurate WGS-pathogen comparison within a time frame allowing for infection prevention and control (IPC) interventions. Methods:WGS was prospectively performed on pathogens at increased risk of potential healthcare transmission using the ONT MinION sequencer with R10.4.1 flow cells and Dorado basecaller. Potential transmission was assessed via Ridom SeqSphere+ for core genome multilocus sequence typing and MINTyper for reference-based core genome single nucleotide polymorphisms using previously published cutoff values. The accuracy of our ONT pipeline was determined relative to Illumina. Results:Over a six-month period, 242 bacterial isolates from 216 patients were sequenced by a single operator. Compared to the Illumina gold standard, our ONT pipeline achieved a mean identity score of Q60 for assembled genomes, even with a coverage rate as low as 40×. The mean time from initiating DNA extraction to complete analysis was 2 days (IQR 2–3.25 days). We identified five potential transmission clusters comprising 21 isolates (8.7% of sequenced strains). Integrating ONT with epidemiological data, >70% (15/21) of putative transmission cluster isolates originated from patients with potential healthcare transmission links. Conclusions:Via a stand-alone ONT pipeline, we detected potentially transmitted HAI pathogens rapidly and accurately, aligning closely with epidemiological data. Our low-resource method has the potential to assist in IPC efforts. 
    more » « less
  5. Abstract Genetic diversity plays a key role in maintaining population viability by preventing inbreeding depression and providing the building blocks for adaptation. Understanding how genetic diversity varies across space is, therefore, of key interest in conservation and population genetics.Here, we introducewingen, anrpackage for calculating continuous maps of genetic diversity, including nucleotide diversity, allelic richness, and heterozygosity, from standard genotypic and spatial data using a spatial moving window approach. We provide functions to account for variation in sample size across space using rarefaction, to create kriging‐interpolated maps of genetic diversity, and to mask any areas that are outside the area of interest.Tests with simulated and empirical datasets demonstrate thatwingencan successfully capture variation in genetic diversity across landscapes from both reduced‐representation and whole genome sequencing datasets. For reduced‐representation datasets,wingen's functions can be run easily on a standard laptop computer, and we provide options for parallelization to increase the efficiency of running larger whole genome datasets.wingenprovides novel and computationally tractable tools for creating informative maps of genetic diversity with applications for conservation prioritization as well as population and landscape genetic analyses. 
    more » « less