skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Fast, Reproducible, High-throughput Variant Calling Workflow for Population Genomics
Abstract The increasing availability of genomic resequencing data sets and high-quality reference genomes across the tree of life present exciting opportunities for comparative population genomic studies. However, substantial challenges prevent the simple reuse of data across different studies and species, arising from variability in variant calling pipelines, data quality, and the need for computationally intensive reanalysis. Here, we present snpArcher, a flexible and highly efficient workflow designed for the analysis of genomic resequencing data in nonmodel organisms. snpArcher provides a standardized variant calling pipeline and includes modules for variant quality control, data visualization, variant filtering, and other downstream analyses. Implemented in Snakemake, snpArcher is user-friendly, reproducible, and designed to be compatible with high-performance computing clusters and cloud environments. To demonstrate the flexibility of this pipeline, we applied snpArcher to 26 public resequencing data sets from nonmammalian vertebrates. These variant data sets are hosted publicly to enable future comparative population genomic analyses. With its extensibility and the availability of public data sets, snpArcher will contribute to a broader understanding of genetic variation across species by facilitating the rapid use and reuse of large genomic data sets.  more » « less
Award ID(s):
1754397
PAR ID:
10503731
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Editor(s):
Rzhetsky, Andrey
Publisher / Repository:
Molecular Biology and Evolution
Date Published:
Journal Name:
Molecular Biology and Evolution
Volume:
41
Issue:
1
ISSN:
0737-4038
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The burgeoning field of genomics as applied to personalized medicine, epidemiology, conservation, agriculture, forensics, drug development, and other fields comes with large computational and bioinformatics costs, which are often inaccessible to student trainees in classroom settings at universities. However, with increased availability of resources such as NSF XSEDE, Google Cloud, Amazon AWS, and other high-performance computing (HPC) clouds and clusters for educational purposes, a growing community of academicians are working on teaching the utility of HPC resources in genomics and big data analyses. Here, I describe the successful implementation of a semester-long (16 week) upper division undergraduate/graduate level course in Computational Genomics and Bioinformatics taught at San Diego State University in Spring 2022. Students were trained in the theory, algorithms and hands-on applications of genomic data quality control, assembly, annotation, multiple sequence alignment, variant calling, phylogenomic analyses, population genomics, genome-wide association studies, and differential gene expression analyses using RNAseq data on their own dedicated 6-CPU NSF XSEDE Jetstream virtual machines. All lesson plans, activities, examinations, tutorials, code, lectures, and notes are publicly available at https://github.com/arunsethuraman/biomi609spring2022. 
    more » « less
  2. Abstract BackgroundHigh-quality genomic resources facilitate investigations into behavioral ecology, morphological and physiological adaptations, and the evolution of genomic architecture. Lizards in the genus Sceloporus have a long history as important ecological, evolutionary, and physiological models, making them a valuable target for the development of genomic resources. FindingsWe present a high-quality chromosome-level reference genome assembly, SceUnd1.0 (using 10X Genomics Chromium, HiC, and Pacific Biosciences data), and tissue/developmental stage transcriptomes for the eastern fence lizard, Sceloporus undulatus. We performed synteny analysis with other snake and lizard assemblies to identify broad patterns of chromosome evolution including the fusion of micro- and macrochromosomes. We also used this new assembly to provide improved reference-based genome assemblies for 34 additional Sceloporus species. Finally, we used RNAseq and whole-genome resequencing data to compare 3 assemblies, each representing an increased level of cost and effort: Supernova Assembly with data from 10X Genomics Chromium, HiRise Assembly that added data from HiC, and PBJelly Assembly that added data from Pacific Biosciences sequencing. We found that the Supernova Assembly contained the full genome and was a suitable reference for RNAseq and single-nucleotide polymorphism calling, but the chromosome-level scaffolds provided by the addition of HiC data allowed synteny and whole-genome association mapping analyses. The subsequent addition of PacBio data doubled the contig N50 but provided negligible gains in scaffold length. ConclusionsThese new genomic resources provide valuable tools for advanced molecular analysis of an organism that has become a model in physiology and evolutionary ecology. 
    more » « less
  3. SUMMARY Maize (Zea maysssp.mays) populations exhibit vast ranges of genetic and phenotypic diversity. As sequencing costs have declined, an increasing number of projects have sought to measure genetic differences between and within maize populations using whole‐genome resequencing strategies, identifying millions of segregating single‐nucleotide polymorphisms (SNPs) and insertions/deletions (InDels). Unlike older genotyping strategies like microarrays and genotyping by sequencing, resequencing should, in principle, frequently identify and score common genetic variants. However, in practice, different projects frequently employ different analytical pipelines, often employ different reference genome assemblies and consistently filter for minor allele frequency within the study population. This constrains the potential to reuse and remix data on genetic diversity generated from different projects to address new biological questions in new ways. Here, we employ resequencing data from 1276 previously published maize samples and 239 newly resequenced maize samples to generate a single unified marker set of approximately 366 million segregating variants and approximately 46 million high‐confidence variants scored across crop wild relatives, landraces as well as tropical and temperate lines from different breeding eras. We demonstrate that the new variant set provides increased power to identify known causal flowering‐time genes using previously published trait data sets, as well as the potential to track changes in the frequency of functionally distinct alleles across the global distribution of modern maize. 
    more » « less
  4. Abstract High‐throughput sequencing has changed many aspects of population genetics, molecular ecology and related fields, affecting both experimental design and data analysis. The software packageangsdallows users to perform a number of population genetic analyses on high‐throughput sequencing data.angsduses probabilistic approaches which can directly make use of genotype likelihoods; thus,SNPcalling is not required for comparative analyses. This takes advantage of all the sequencing data and produces more accurate results for samples with low sequencing depth. Here, we presentangsd‐wrapper, a set of wrapper scripts that provides a user‐friendly interface for runningangsdand visualizing results.angsd‐wrapper supports multiple types of analyses including estimates of nucleotide sequence diversity neutrality tests, principal component analysis, estimation of admixture proportions for individual samples and calculation of statistics that quantify recent introgression.angsd‐wrapper also provides interactive graphing ofangsdresults to enhance data exploration. We demonstrate the usefulness ofangsd‐wrapper by analysing resequencing data from populations of wild and domesticatedZea.angsd‐wrapper is freely available fromhttps://github.com/mojaveazure/angsd-wrapper. 
    more » « less
  5. Abstract Genome sequence contamination has a variety of causes and can originate from within or between species. Previous research focused primarily on cross-species contamination or on prokaryotes. This paper visualizes B-allele frequency to test for intra-species contamination, and measures its effects on phylogenetic and admixture analysis in two fungal species. Using a standard base calling pipeline, we found that contaminated genomes superficially appeared to produce good quality genome data. Yet as little as 5-10% genome contamination was enough to change phylogenetic tree topologies and make contaminated strains appear as hybrids between lineages (genetically admixed). We recommend the use of B-allele frequency plots to screen genome resequencing data for intra-species contamination. 
    more » « less