skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Genotype-Frequency Estimation from High-Throughput Sequencing Data
Abstract Rapidly improving high-throughput sequencing technologies provide unprecedented opportunities for carrying out population-genomic studies with various organisms. To take full advantage of these methods, it is essential to correctly estimate allele and genotype frequencies, and here we present a maximum-likelihood method that accomplishes these tasks. The proposed method fully accounts for uncertainties resulting from sequencing errors and biparental chromosome sampling and yields essentially unbiased estimates with minimal sampling variances with moderately high depths of coverage regardless of a mating system and structure of the population. Moreover, we have developed statistical tests for examining the significance of polymorphisms and their genotypic deviations from Hardy–Weinberg equilibrium. We examine the performance of the proposed method by computer simulations and apply it to low-coverage human data generated by high-throughput sequencing. The results show that the proposed method improves our ability to carry out population-genomic analyses in important ways. The software package of the proposed method is freely available from https://github.com/Takahiro-Maruki/Package-GFE.  more » « less
Award ID(s):
1759906
PAR ID:
10290731
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Genetics
Volume:
201
Issue:
2
ISSN:
1943-2631
Page Range / eLocation ID:
473 to 486
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Whole‐genome sequencing data allow survey of variation from across the genome, reducing the constraint of balancing genome sub‐sampling with estimating recombination rates and linkage between sampled markers and target loci. As sequencing costs decrease, low‐coverage whole‐genome sequencing of pooled or indexed‐individual samples is commonly utilized to identify loci associated with phenotypes or environmental axes in non‐model organisms. There are, however, relatively few publicly available bioinformatic pipelines designed explicitly to analyse these types of data, and fewer still that process the raw sequencing data, provide useful metrics of quality control and then execute analyses. Here, we present an updated version of a bioinformatics pipeline calledPoolParty2that can effectively handle either pooled or indexed DNA samples and includes new features to improve computational efficiency. Using simulated data, we demonstrate the ability of our pipeline to recover segregating variants, estimate their allele frequencies accurately, and identify genomic regions harbouring loci under selection. Based on the simulated data set, we benchmark the efficacy of our pipeline with another bioinformatic suite,angsd, and illustrate the compatibility and complementarity of these suites usingangsdto generate genotype likelihoods as input for identifying linkage outlier regions using alignment files and variants provided byPoolParty2. Finally, we apply our updated pipeline to an empirical dataset of low‐coverage whole genomic data from population samples of Columbia River steelhead trout (Oncorhynchus mykiss), results from which demonstrate the genomic impacts of decades of artificial selection in a prominent hatchery stock. Thus, we not only demonstrate the utility ofPoolParty2for genomic studies that combine sequencing data from multiple individuals, but also illustrate how it compliments other bioinformatics resources such asangsd. 
    more » « less
  2. DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets. 
    more » « less
  3. Abstract BackgroundModern plant breeding strategies rely on the intensive use of advanced genomic tools to expedite the development of improved crop varieties. Genomic DNA extraction from crop seeds eliminates the need to grow plants in contrast to fresh leaf tissue; however, it can still be a bottleneck due to the presence of stored compounds and the complexity of the matrix. The interaction of environmentally benign choline-based ionic liquids (ILs) with DNA offers an innovative approach to enhance the quality of extracted DNA from seeds. While prior IL-based plant DNA extraction workflows have primarily supported polymerase chain reaction (PCR) and quantitative PCR-based applications, their suitability for high-throughput sequencing (HTS) remained largely unexplored. This study explores the efficacy of IL-assisted method for genomic DNA extraction from soybean (Glycine max) seeds, addressing the limited application of ILs in HTS. ResultsThe optimized DNA extraction method, utilizing 25% (w/v) choline formate, enabled the recovery of high-purity DNA with abundant fragment sizes > 20 kb, suitable for downstream applications including PCR, whole genome amplification (WGA), simple sequence repeat (SSR) amplification, and high-throughput Illumina sequencing. The IL-method was benchmarked against a silica-binding method using cetyltrimethylammonium bromide (CTAB) and sodium dodecyl sulfate (SDS) as lysis agents using a commercial plant DNA extraction kit in terms of DNA yield, purity, abundant DNA fragment size distribution, and integrity. In addition, DNA isolated from this method demonstrated successful PCR amplification of markers from both the nuclear and plastid genomes and yielded > 99% whole genome coverage with Illumina (PE150) sequencing reads. ConclusionsThis is the first known instance of a whole genome sequence generated from DNA extracted with ILs. These findings mark a significant milestone in establishing ILs as promising alternatives to conventional methods for seed DNA extraction, with potential utility in third generation (long-read) sequencing experiments. 
    more » « less
  4. Abstract The development of next-generation sequencing (NGS) enabled a shift from array-based genotyping to directly sequencing genomic libraries for high-throughput genotyping. Even though whole-genome sequencing was initially too costly for routine analysis in large populations such as breeding or genetic studies, continued advancements in genome sequencing and bioinformatics have provided the opportunity to capitalize on whole-genome information. As new sequencing platforms can routinely provide high-quality sequencing data for sufficient genome coverage to genotype various breeding populations, a limitation comes in the time and cost of library construction when multiplexing a large number of samples. Here we describe a high-throughput whole-genome skim-sequencing (skim-seq) approach that can be utilized for a broad range of genotyping and genomic characterization. Using optimized low-volume Illumina Nextera chemistry, we developed a skim-seq method and combined up to 960 samples in one multiplex library using dual index barcoding. With the dual-index barcoding, the number of samples for multiplexing can be adjusted depending on the amount of data required, and could be extended to 3,072 samples or more. Panels of doubled haploid wheat lines (Triticum aestivum, CDC Stanley x CDC Landmark), wheat-barley (T.aestivumxHordeum vulgare) and wheat-wheatgrass (Triticum durum x Thinopyrum intermedium) introgression lines as well as known monosomic wheat stocks were genotyped using the skim-seq approach. Bioinformatics pipelines were developed for various applications where sequencing coverage ranged from 1 × down to 0.01 × per sample. Using reference genomes, we detected chromosome dosage, identified aneuploidy, and karyotyped introgression lines from the skim-seq data. Leveraging the recent advancements in genome sequencing, skim-seq provides an effective and low-cost tool for routine genotyping and genetic analysis, which can track and identify introgressions and genomic regions of interest in genetics research and applied breeding programs. 
    more » « less
  5. Elkins, Christopher A (Ed.)
    ABSTRACT Municipal wastewater harbors diverse RNA viruses, which are responsible for many emerging and reemerging diseases in humans, animals, and plants. Although genomic sequencing can be a high-throughput approach for profiling the RNA virome in wastewater, wastewater processing methods often influence sequencing outcomes. Here, we systematically evaluated two wastewater processing methods, tangential-flow ultrafiltration (TFF) and Nanotrap Microbiome A Particles, for detecting the target RNA virus severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) via amplicon sequencing and characterizing the RNA virome using whole-transcriptome shotgun sequencing. Our results from paired comparison tests showed that the TFF and Nanotrap methods recovered similar SARS-CoV-2 variants at the lineage level (analysis of similarity [ANOSIM]R= −0.012,P= 0.874). Optimizing automated procedures for the Nanotrap method and concentration factors for the TFF method was critical for achieving high-depth and high-breadth coverage of the target virus genome. Notably, the two methods enriched distinct RNA viromes from the same wastewater samples (ANOSIMR= 0.260,P= 0.002), with TFF samples showing 22-fold and 7-fold higher relative abundances ofReoviridaeandCoronaviridae, respectively. These differences are likely due to the distinct virus concentration mechanisms employed by each method, which are influenced by liquid-solid partitioning of virus particles and interactions of viral surface proteins with ligands. Our findings underscore the importance of optimizing wastewater processing methods for genomic monitoring and have implications for broader environmental applications.IMPORTANCEWastewater genomic sequencing is an emerging technology for tracking viral infections within communities. However, different methods for concentrating viruses and extracting nucleic acids can influence the recoveries of RNA virome from wastewater. An in-depth understanding of virus concentration mechanisms and their impact on sequencing data quality and bioinformatic output would be critical to guide method selection and optimization. Specifically, this study systematically evaluated tangential-flow ultrafiltration and Nanotrap microbiome particles for their application to sequence SARS-CoV-2 and whole RNA virome from wastewater. Both methods yielded high-quality sequencing data for amplicon sequencing of SARS-CoV-2, but their outcomes diverged in the recovered RNA virome. We identified RNA viruses that are preferentially recovered by each of these two methods and proposed considerations of method selection for future studies of wastewater RNA virome. 
    more » « less