skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Seeding with minimized subsequence
Abstract MotivationModern methods for computation-intensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. ResultsWe propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seed-matches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis. Availability and implementationSubseqHash is freely available at https://github.com/Shao-Group/subseqhash.  more » « less
Award ID(s):
2145171 2019797
PAR ID:
10427209
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
Supplement_1
ISSN:
1367-4803
Page Range / eLocation ID:
p. i232-i241
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer Gs(k) such that there exist at least two distinct strings of length Gs(k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G2(k). Second, we further improve this bound using the approach by Dudik and Schulman. 
    more » « less
  2. The k-deck problem is concerned with finding the smallest positive integer S(k) such that there exist at least two strings of length S(k) that share the same k-deck, i.e., the multiset of subsequences of length k. We introduce the new problem of gapped k-deck reconstruction: For a given gap parameter s, we seek the smallest positive integer G s (k) such that there exist at least two distinct strings of length G s (k) that cannot be distinguished based on a "gapped" set of k-subsequences. The gap constraint requires the elements in the subsequences to be at least s positions apart within the original string. Our results are as follows. First, we show how to construct sequences sharing the same 2-gapped k-deck using a nontrivial modification of the recursive Morse-Thue string construction procedure. This establishes the first known constructive upper bound on G 2 (k). Second, we further improve this bound using the approach by Dudik and Schulman [6]. 
    more » « less
  3. Abstract MotivationAccurate identification of genotypes is an essential part of the analysis of genomic data, including in identification of sequence polymorphisms, linking mutations with disease and determining mutation rates. Biological and technical processes that adversely affect genotyping include copy-number-variation, paralogous sequences, library preparation, sequencing error and reference-mapping biases, among others. ResultsWe modeled the read depth for all data as a mixture of Dirichlet-multinomial distributions, resulting in significant improvements over previously used models. In most cases the best model was comprised of two distributions. The major-component distribution is similar to a binomial distribution with low error and low reference bias. The minor-component distribution is overdispersed with higher error and reference bias. We also found that sites fitting the minor component are enriched for copy number variants and low complexity regions, which can produce erroneous genotype calls. By removing sites that do not fit the major component, we can improve the accuracy of genotype calls. Availability and ImplementationMethods and data files are available at https://github.com/CartwrightLab/WuEtAl2017/ (doi:10.5281/zenodo.256858). Supplementary informationSupplementary data is available at Bioinformatics online. 
    more » « less
  4. Abstract BackgroundGiven a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a “similar sequence”. Traditionally, “similar sequence” was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. ResultsIn this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in$$\mathcal {O} (|t| + |p| + \ell ^2)$$ O ( | t | + | p | + 2 ) time and$$\mathcal {O} (\ell \log \ell )$$ O ( log ) space, where |t| is the number of$$k$$ k -mers inside the sketch of the reference, |p| is the number of$$k$$ k -mers inside the read’s sketch and$$\ell$$ is the number of times that$$k$$ k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm’s performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2. 
    more » « less
  5. Alkan, Can (Ed.)
    Abstract MotivationDetection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments. ResultsWe show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome. Availability and implementationhttps://github.com/joshidhaivat/HQAlign.git. 
    more » « less