skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Variant calling and quality control of large-scale human genome sequencing data
Abstract Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.  more » « less
Award ID(s):
1705197
PAR ID:
10155514
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Emerging Topics in Life Sciences
Volume:
3
Issue:
4
ISSN:
2397-8554
Page Range / eLocation ID:
399 to 409
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Background Modern Next Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. Parallel developments in Deep Learning have enabled the application of Deep Neural Networks to variant calling, surpassing the accuracy of classical approaches in many settings. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. In this paper, we explore an alternative approach to designing Deep Neural Networks for variant calling, where we use meticulously designed Deep Neural Network architectures and customized variant inference functions that account for the underlying nature of sequencing data instead of converting the problem to one of image recognition. Results Results from 27 whole-genome variant calling experiments spanning Illumina, PacBio and hybrid Illumina-PacBio settings suggest that our method allows vastly smaller Deep Neural Networks to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. For example, our method reduces the number of indel call errors by up to 18%, 55% and 65% for Illumina, PacBio and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, our models are between 7 and 14 times smaller. Conclusions We believe that the improved accuracy and problem-specific customization of our models will enable more accurate pipelines and further method development in the field. HELLO is available at https://github.com/anands-repo/hello 
    more » « less
  2. Abstract MotivationSingle-nucleotide variants (SNVs) are the most common variations in the human genome. Recently developed methods for SNV detection from single-cell DNA sequencing data, such as SCIΦ and scVILP, leverage the evolutionary history of the cells to overcome the technical errors associated with single-cell sequencing protocols. Despite being accurate, these methods are not scalable to the extensive genomic breadth of single-cell whole-genome (scWGS) and whole-exome sequencing (scWES) data. ResultsHere, we report on a new scalable method, Phylovar, which extends the phylogeny-guided variant calling approach to sequencing datasets containing millions of loci. Through benchmarking on simulated datasets under different settings, we show that, Phylovar outperforms SCIΦ in terms of running time while being more accurate than Monovar (which is not phylogeny-aware) in terms of SNV detection. Furthermore, we applied Phylovar to two real biological datasets: an scWES triple-negative breast cancer data consisting of 32 cells and 3375 loci as well as an scWGS data of neuron cells from a normal human brain containing 16 cells and approximately 2.5 million loci. For the cancer data, Phylovar detected somatic SNVs with high or moderate functional impact that were also supported by bulk sequencing dataset and for the neuron dataset, Phylovar identified 5745 SNVs with non-synonymous effects some of which were associated with neurodegenerative diseases. Availability and implementationPhylovar is implemented in Python and is publicly available at https://github.com/NakhlehLab/Phylovar. 
    more » « less
  3. Abstract Background Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome next-generation sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. Results We introduce NPSV, a machine learning–based approach for genotyping previously discovered SVs that uses next-generation sequencing simulation to model the combined effects of the genomic region, sequencer, and alignment pipeline on the observed SV evidence. We evaluate NPSV alongside existing SV genotypers on multiple benchmark call sets. We show that NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Conclusions Growing SV databases and the increasing availability of SV calls from long-read sequencing make stand-alone genotyping of previously identified SVs an increasingly important component of genome analyses. By treating potential biases as a “black box” that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications. 
    more » « less
  4. Pettigrew, Melinda M. (Ed.)
    ABSTRACT Viral genome sequencing has guided our understanding of the spread and extent of genetic diversity of SARS-CoV-2 during the COVID-19 pandemic. SARS-CoV-2 viral genomes are usually sequenced from nasopharyngeal swabs of individual patients to track viral spread. Recently, RT-qPCR of municipal wastewater has been used to quantify the abundance of SARS-CoV-2 in several regions globally. However, metatranscriptomic sequencing of wastewater can be used to profile the viral genetic diversity across infected communities. Here, we sequenced RNA directly from sewage collected by municipal utility districts in the San Francisco Bay Area to generate complete and nearly complete SARS-CoV-2 genomes. The major consensus SARS-CoV-2 genotypes detected in the sewage were identical to clinical genomes from the region. Using a pipeline for single nucleotide variant calling in a metagenomic context, we characterized minor SARS-CoV-2 alleles in the wastewater and detected viral genotypes which were also found within clinical genomes throughout California. Observed wastewater variants were more similar to local California patient-derived genotypes than they were to those from other regions within the United States or globally. Additional variants detected in wastewater have only been identified in genomes from patients sampled outside California, indicating that wastewater sequencing can provide evidence for recent introductions of viral lineages before they are detected by local clinical sequencing. These results demonstrate that epidemiological surveillance through wastewater sequencing can aid in tracking exact viral strains in an epidemic context. 
    more » « less
  5. INTRODUCTION Genome-wide association studies (GWASs) have identified thousands of human genetic variants associated with diverse diseases and traits, and most of these variants map to noncoding loci with unknown target genes and function. Current approaches to understand which GWAS loci harbor causal variants and to map these noncoding regulators to target genes suffer from low throughput. With newer multiancestry GWASs from individuals of diverse ancestries, there is a pressing and growing need to scale experimental assays to connect GWAS variants with molecular mechanisms. Here, we combined biobank-scale GWASs, massively parallel CRISPR screens, and single-cell sequencing to discover target genes of noncoding variants for blood trait loci with systematic targeting and inhibition of noncoding GWAS loci with single-cell sequencing (STING-seq). RATIONALE Blood traits are highly polygenic, and GWASs have identified thousands of noncoding loci that map to candidate cis -regulatory elements (CREs). By combining CRE-silencing CRISPR perturbations and single-cell readouts, we targeted hundreds of GWAS loci in a single assay, revealing target genes in cis and in trans . For select CREs that regulate target genes, we performed direct variant insertion. Although silencing the CRE can identify the target gene, direct variant insertion can identify magnitude and direction of effect on gene expression for the GWAS variant. In select cases in which the target gene was a transcription factor or microRNA, we also investigated the gene-regulatory networks altered upon CRE perturbation and how these networks differ across blood cell types. RESULTS We inhibited candidate CREs from fine-mapped blood trait GWAS variants (from ~750,000 individual of diverse ancestries) in human erythroid progenitors. In total, we targeted 543 variants (254 loci) mapping to candidate CREs, generating multimodal single-cell data including transcriptome, direct CRISPR gRNA capture, and cell surface proteins. We identified target genes in cis (within 500 kb) for 134 CREs. In most cases, we found that the target gene was the closest gene and that specific enhancer-associated biochemical hallmarks (H3K27ac and accessible chromatin) are essential for CRE function. Using multiple perturbations at the same locus, we were able to distinguished between causal variants from noncausal variants in linkage disequilibrium. For a subset of validated CREs, we also inserted specific GWAS variants using base-editing STING-seq (beeSTING-seq) and quantified the effect size and direction of GWAS variants on gene expression. Given our transcriptome-wide data, we examined dosage effects in cis and trans in cases in which the cis target is a transcription factor or microRNA. We found that trans target genes are also enriched for GWAS loci, and identified gene clusters within trans gene networks with distinct biological functions and expression patterns in primary human blood cells. CONCLUSION In this work, we investigated noncoding GWAS variants at scale, identifying target genes in single cells. These methods can help to address the variant-to-function challenges that are a barrier for translation of GWAS findings (e.g., drug targets for diseases with a genetic basis) and greatly expand our ability to understand mechanisms underlying GWAS loci. Identifying causal variants and their target genes with STING-seq. Uncovering causal variants and their target genes or function are a major challenge for GWASs. STING-seq combines perturbation of noncoding loci with multimodal single-cell sequencing to profile hundreds of GWAS loci in parallel. This approach can identify target genes in cis and trans , measure dosage effects, and decipher gene-regulatory networks. 
    more » « less