skip to main content


Title: lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data
Abstract Motivation

Recent advances in genomics and precision medicine have been made possible through the application of high throughput sequencing (HTS) to large collections of human genomes. Although HTS technologies have proven their use in cataloging human genome variation, computational analysis of the data they generate is still far from being perfect. The main limitation of Illumina and other popular sequencing technologies is their short read length relative to the lengths of (common) genomic repeats. Newer (single molecule sequencing – SMS) technologies such as Pacific Biosciences and Oxford Nanopore are producing longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. Unfortunately, because of their high sequencing error rate, reads generated by these technologies are very difficult to work with and cannot be used in many of the standard downstream analysis pipelines. Note that it is not only difficult to find the correct mapping locations of such reads in a reference genome, but also to establish their correct alignment so as to differentiate sequencing errors from real genomic variants. Furthermore, especially since newer SMS instruments provide higher throughput, mapping and alignment need to be performed much faster than before, maintaining high sensitivity.

Results

We introduce lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Availability and implementation

lordFAST is implemented in C++ and supports multi-threading. The source code of lordFAST is available at https://github.com/vpc-ccg/lordfast.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
NSF-PAR ID:
10393437
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
35
Issue:
1
ISSN:
1367-4803
Page Range / eLocation ID:
p. 20-27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    As RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains––a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.

    Results

    This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1–10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.

    Availability and implementation

    TenSQR is available at https://github.com/SoYeonA/TenSQR.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Alkan, Can (Ed.)
    Abstract Motivation

    Detection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments.

    Results

    We show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.

    Availability and implementation

    https://github.com/joshidhaivat/HQAlign.git.

     
    more » « less
  3. Background

    Metagenomics has transformed our understanding of microbial diversity across ecosystems, with recent advances enablingde novoassembly of genomes from metagenomes. These metagenome-assembled genomes are critical to provide ecological, evolutionary, and metabolic context for all the microbes and viruses yet to be cultivated. Metagenomes can now be generated from nanogram to subnanogram amounts of DNA. However, these libraries require several rounds of PCR amplification before sequencing, and recent data suggest these typically yield smaller and more fragmented assemblies than regular metagenomes.

    Methods

    Here we evaluatede novoassembly methods of 169 PCR-amplified metagenomes, including 25 for which an unamplified counterpart is available, to optimize specific assembly approaches for PCR-amplified libraries. We first evaluated coverage bias by mapping reads from PCR-amplified metagenomes onto reference contigs obtained from unamplified metagenomes of the same samples. Then, we compared different assembly pipelines in terms of assembly size (number of bp in contigs ≥ 10 kb) and error rates to evaluate which are the best suited for PCR-amplified metagenomes.

    Results

    Read mapping analyses revealed that the depth of coverage within individual genomes is significantly more uneven in PCR-amplified datasets versus unamplified metagenomes, with regions of high depth of coverage enriched in short inserts. This enrichment scales with the number of PCR cycles performed, and is presumably due to preferential amplification of short inserts. Standard assembly pipelines are confounded by this type of coverage unevenness, so we evaluated other assembly options to mitigate these issues. We found that a pipeline combining read deduplication and an assembly algorithm originally designed to recover genomes from libraries generated after whole genome amplification (single-cell SPAdes) frequently improved assembly of contigs ≥10 kb by 10 to 100-fold for low input metagenomes.

    Conclusions

    PCR-amplified metagenomes have enabled scientists to explore communities traditionally challenging to describe, including some with extremely low biomass or from which DNA is particularly difficult to extract. Here we show that a modified assembly pipeline can lead to an improvedde novogenome assembly from PCR-amplified datasets, and enables a better genome recovery from low input metagenomes.

     
    more » « less
  4. INTRODUCTION One of the central applications of the human reference genome has been to serve as a baseline for comparison in nearly all human genomic studies. Unfortunately, many difficult regions of the reference genome have remained unresolved for decades and are affected by collapsed duplications, missing sequences, and other issues. Relative to the current human reference genome, GRCh38, the Telomere-to-Telomere CHM13 (T2T-CHM13) genome closes all remaining gaps, adds nearly 200 million base pairs (Mbp) of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for scientific inquiry. RATIONALE We demonstrate how the T2T-CHM13 reference genome universally improves read mapping and variant identification in a globally diverse cohort. This cohort includes all 3202 samples from the expanded 1000 Genomes Project (1KGP), sequenced with short reads, as well as 17 globally diverse samples sequenced with long reads. By applying state-of-the-art methods for calling single-nucleotide variants (SNVs) and structural variants (SVs), we document the strengths and limitations of T2T-CHM13 relative to its predecessors and highlight its promise for revealing new biological insights within technically challenging regions of the genome. RESULTS Across the 1KGP samples, we found more than 1 million additional high-quality variants genome-wide using T2T-CHM13 than with GRCh38. Within previously unresolved regions of the genome, we identified hundreds of thousands of variants per sample—a promising opportunity for evolutionary and biomedical discovery. T2T-CHM13 improves the Mendelian concordance rate among trios and eliminates tens of thousands of spurious SNVs per sample, including a reduction of false positives in 269 challenging, medically relevant genes by up to a factor of 12. These corrections are in large part due to improvements to 70 protein-coding genes in >9 Mbp of inaccurate sequence caused by falsely collapsed or duplicated regions in GRCh38. Using the T2T-CHM13 genome also yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions. Finally, by providing numerous resources for T2T-CHM13 (including 1KGP genotypes, accessibility masks, and prominent annotation databases), our work will facilitate the transition to T2T-CHM13 from the current reference genome. CONCLUSION The vast improvements in variant discovery across samples of diverse ancestries position T2T-CHM13 to succeed as the next prevailing reference for human genetics. T2T-CHM13 thus offers a model for the construction and study of high-quality reference genomes from globally diverse individuals, such as is now being pursued through collaboration with the Human Pangenome Reference Consortium. As a foundation, our work underscores the benefits of an accurate and complete reference genome for revealing diversity across human populations. Genomic features and resources available for T2T-CHM13. Comparisons to GRCh38 reveal broad improvements in SNVs, indels, and SVs discovered across diverse human populations by means of short-read (1KGP) and long-read sequencing (LRS). These improvements are due to resolution of complex genomic loci (nonsyntenic and previously unresolved), duplication errors, and discordant haplotypes, including those in medically relevant genes. 
    more » « less
  5. Abstract Motivation

    High-throughput mRNA sequencing (RNA-Seq) is a powerful tool for quantifying gene expression. Identification of transcript isoforms that are differentially expressed in different conditions, such as in patients and healthy subjects, can provide insights into the molecular basis of diseases. Current transcript quantification approaches, however, do not take advantage of the shared information in the biological replicates, potentially decreasing sensitivity and accuracy.

    Results

    We present a novel hierarchical Bayesian model called Differentially Expressed Isoform detection from Multiple biological replicates (DEIsoM) for identifying differentially expressed (DE) isoforms from multiple biological replicates representing two conditions, e.g. multiple samples from healthy and diseased subjects. DEIsoM first estimates isoform expression within each condition by (1) capturing common patterns from sample replicates while allowing individual differences, and (2) modeling the uncertainty introduced by ambiguous read mapping in each replicate. Specifically, we introduce a Dirichlet prior distribution to capture the common expression pattern of replicates from the same condition, and treat the isoform expression of individual replicates as samples from this distribution. Ambiguous read mapping is modeled as a multinomial distribution, and ambiguous reads are assigned to the most probable isoform in each replicate. Additionally, DEIsoM couples an efficient variational inference and a post-analysis method to improve the accuracy and speed of identification of DE isoforms over alternative methods. Application of DEIsoM to an hepatocellular carcinoma (HCC) dataset identifies biologically relevant DE isoforms. The relevance of these genes/isoforms to HCC are supported by principal component analysis (PCA), read coverage visualization, and the biological literature.

    Availability and implementation

    The software is available at https://github.com/hao-peng/DEIsoM

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less