 Award ID(s):
 2008602
 NSFPAR ID:
 10275784
 Date Published:
 Journal Name:
 Proceedings of the VLDB Endowment
 Volume:
 14
 Issue:
 4
 ISSN:
 21508097
 Page Range / eLocation ID:
 471 to 484
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Mikolaj Bojanczyk ; Emanuela Merelli ; David P. Woodruff (Ed.)Two equal length strings are a parameterized match (pmatch) iff there exists a onetoone function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+P1] and P are a pmatch. The PST can count the number of occurrences of P in T in time O(Plog σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant.  Space O(n logσ) bits and query time O(log_σ^ε n);  Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and  Space O(n logσ log^ε_σ n) bits and query time O(1). The first tradeoff is an improvement over Ganguly et al.’s result, whereas our third tradeoff matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our tradeoffs match the spaceandtime bounds of the bestknown compressed text indexes for exact pattern matching and further improvement is highly unlikely.more » « less

null (Ed.)Given an input string, the BurrowsWheeler Transform (BWT) can be seen as a reversible permutation of it that allows efficient compression and fast substring queries. Due to these properties, it has been widely applied in the analysis of genomic sequence data, enabling important tasks such as read alignment. Mantaci et al. [TCS2007] extended the notion of the BWT to a collection of strings by defining the extended BurrowsWheeler Transform (eBWT). This definition requires no modification of the input collection, and has the property that the output is independent of the order of the strings in the collection. However, over the years, the term eBWT has been used more generally to describe any BWT of a collection of strings. The fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple lineartime algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first lineartime algorithm for computing the BWT of a single string that uses neither an endofstring symbol nor Lyndon rotations. We also combine our new eBWT construction with a variation of prefixfree parsing (PFP) [WABI 2019] to allow for construction of the eBWT on large collections of genomic sequences. We implement this combined algorithm (pfpebwt) and evaluate it on a collection of human chromosomes 19 from the 1,000 Genomes Project, on a collection of Salmonella genomes from GenomeTrakr, and on a collection of SARSCoV2 genomes from EBI’s COVID19 data portal. We demonstrate that pfpebwt is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak.more » « less

Abstract Motivation Modern methods for computationintensive tasks in sequence analysis (e.g. read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regularlength seeds so that compact data structures and efficient algorithms can be employed to handle the evergrowing largescale data. Seeding methods using kmers (substrings of length k) have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors.
Results We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order overall lengthk strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to the Jaccard index. We then show that SubseqHash overwhelmingly outperforms the substringbased seeding methods in producing highquality seedmatches for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for longreads analysis.
Availability and implementation SubseqHash is freely available at https://github.com/ShaoGroup/subseqhash.

Belazzougui, Djamal ; Ouangraoua, Aïda (Ed.)String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly nontrivial to implement and parallelize. In this paper we present CaPSSA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPSSA has excellent memorylocality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPSSA outperforms existing stateoftheart parallel SA and LCParray construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a boundedcontext SA and show that CaPSSA can easily be extended to exploit this structure to obtain further speedups.more » « less

Realtime decision making in IoT applications relies upon spaceefficient evaluation of queries over streaming data. To model the uncertainty in the classification of data being processed, we consider the model of probabilistic strings  sequences of discrete probability distributions over a finite set of events, and initiate the study of space complexity of streaming computation for different classes of queries over such probabilistic strings. We first consider the problem of computing the probability that a word, sampled from the distribution defined by the probabilistic string read so far, is accepted by a given deterministic finite automaton. We show that this regular pattern matching problem can be solved using space that is only polylogarithmic in the string length (and polynomial in the size of the DFA) if we are allowed a multiplicative approximation error. Then we show how to generalize this result to quantitative queries specified by additive cost register automata  these are automata that map strings to numerical values using finite control and registers that get updated using linear transformations. Finally, we consider the case when updates in such an automaton involve tests, and in particular, when there is a counter variable that can be either incremented or decremented but decrements only apply when the counter value is nonzero. In this case, the desired answer depends on the probability distribution over the set of possible counter values that can range from 0 to n for a string of length n. Under a mild assumption, namely probabilities of the individual events are bounded away from 0 and 1, we show that there is an algorithm that can compute all n entries of this probability distribution vector to within additive 1/poly(n) error using space that is only Õ(n). In establishing these results, we introduce several new technical ideas that may prove useful for designing spaceefficient algorithms for other query models over probabilistic strings.more » « less