We revisit the following version of the Gapped String Indexing problem, where the goal is to preprocess a text T[1..n] to enable efficient reporting of all occ occurrences of a gapped pattern P=P1[α..β]P2 in T. An occurrence of P in T is defined as a pair (i,j) where substrings T[i..i+|P1|) and T[j..j+|P2|) match P1 and P2, respectively, with a gap j−(i+|P1|) lying within the interval [α..β]. This problem has significant applications in computational biology and text mining. A hardness result on this problem suggests that any index with polylogarithmic query time must occupy near quadratic space. In a recent study [STACS 2024], Bille et al. presented a sub-quadratic space index using space O˜(n2−δ/3), where 0≤δ≤1 is a parameter fixed at the time of index construction. Its query time is O˜(|P1|+|P2|+nδ·(1+occ)), which is sub-linear per occurrence when δ<1. We show how to achieve a gap-sensitive query time of O˜(|P1|+|P2|+nδ·(1+occ1−δ)+∑g∈[α..β]occg·gδ) using the same space, where occg denotes the number of occurrences with gap g. This is faster when there are many occurrences with small gaps.
more »
« less
Repetition Aware Text Indexing for Matching Patterns with Wildcards
We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).
more »
« less
- Award ID(s):
- 2315822
- PAR ID:
- 10614547
- Editor(s):
- Censor-Hillel, Keren; Grandoni, Fabrizio; Ouaknine, Joel; Puppis, Gabriele
- Publisher / Repository:
- Schloss Dagstuhl – Leibniz-Zentrum für Informatik
- Date Published:
- Volume:
- 334
- ISSN:
- 1868-8969
- ISBN:
- 978-3-95977-372-0
- Page Range / eLocation ID:
- 88:1-88:20
- Subject(s) / Keyword(s):
- Pattern Matching Text Indexing Wildcard Matching Theory of computation → Pattern matching
- Format(s):
- Medium: X Size: 20 pages; 1142401 bytes Other: application/pdf
- Size(s):
- 20 pages 1142401 bytes
- Right(s):
- Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Bringmann, Karl; Grohe, Martin; Puppis, Gabriele; Svensson, Ola (Ed.)We revisit the noisy binary search model of [Karp and Kleinberg, 2007], in which we have n coins with unknown probabilities p_i that we can flip. The coins are sorted by increasing p_i, and we would like to find where the probability crosses (to within ε) of a target value τ. This generalized the fixed-noise model of [Burnashev and Zigangirov, 1974], in which p_i = 1/2 ± ε, to a setting where coins near the target may be indistinguishable from it. It was shown in [Karp and Kleinberg, 2007] that Θ(1/ε² log n) samples are necessary and sufficient for this task. We produce a practical algorithm by solving two theoretical challenges: high-probability behavior and sharp constants. We give an algorithm that succeeds with probability 1-δ from 1/C_{τ, ε} ⋅ (log₂ n + O(log^{2/3} n log^{1/3} 1/(δ) + log 1/(δ))) samples, where C_{τ, ε} is the optimal such constant achievable. For δ > n^{-o(1)} this is within 1 + o(1) of optimal, and for δ ≪ 1 it is the first bound within constant factors of optimal.more » « less
-
We consider the problem of space-efficiently estimating the number of simplices in a hypergraph stream. This is the most natural hypergraph generalization of the highly-studied problem of estimating the number of triangles in a graph stream. Our input is a k-uniform hypergraph H with n vertices and m hyperedges, each hyperedge being a k-sized subset of vertices. A k-simplex in H is a subhypergraph on k+1 vertices X such that all k+1 possible hyperedges among X exist in H. The goal is to process the hyperedges of H, which arrive in an arbitrary order as a data stream, and compute a good estimate of T_k(H), the number of k-simplices in H. We design a suite of algorithms for this problem. As with triangle-counting in graphs (which is the special case k = 2), sublinear space is achievable but only under a promise of the form T_k(H) ≥ T. Under such a promise, our algorithms use at most four passes and together imply a space bound of O(ε^{-2} log δ^{-1} polylog n ⋅ min{(m^{1+1/k})/T, m/(T^{2/(k+1)})}) for each fixed k ≥ 3, in order to guarantee an estimate within (1±ε)T_k(H) with probability ≥ 1-δ. We also give a simpler 1-pass algorithm that achieves O(ε^{-2} log δ^{-1} log n⋅ (m/T) (Δ_E + Δ_V^{1-1/k})) space, where Δ_E (respectively, Δ_V) denotes the maximum number of k-simplices that share a hyperedge (respectively, a vertex), which generalizes a previous result for the k = 2 case. We complement these algorithmic results with space lower bounds of the form Ω(ε^{-2}), Ω(m^{1+1/k}/T), Ω(m/T^{1-1/k}) and Ω(mΔ_V^{1/k}/T) for multi-pass algorithms and Ω(mΔ_E/T) for 1-pass algorithms, which show that some of the dependencies on parameters in our upper bounds are nearly tight. Our techniques extend and generalize several different ideas previously developed for triangle counting in graphs, using appropriate innovations to handle the more complicated combinatorics of hypergraphs.more » « less
-
Kumar, Amit; Ron-Zewi, Noga (Ed.)We study the Matrix Multiplication Verification Problem (MMV) where the goal is, given three n × n matrices A, B, and C as input, to decide whether AB = C. A classic randomized algorithm by Freivalds (MFCS, 1979) solves MMV in Õ(n²) time, and a longstanding challenge is to (partially) derandomize it while still running in faster than matrix multiplication time (i.e., in o(n^ω) time). To that end, we give two algorithms for MMV in the case where AB - C is sparse. Specifically, when AB - C has at most O(n^δ) non-zero entries for a constant 0 ≤ δ < 2, we give (1) a deterministic O(n^(ω-ε))-time algorithm for constant ε = ε(δ) > 0, and (2) a randomized Õ(n²)-time algorithm using δ/2 ⋅ log₂ n + O(1) random bits. The former algorithm is faster than the deterministic algorithm of Künnemann (ESA, 2018) when δ ≥ 1.056, and the latter algorithm uses fewer random bits than the algorithm of Kimbrel and Sinha (IPL, 1993), which runs in the same time and uses log₂ n + O(1) random bits (in turn fewer than Freivalds’s algorithm). Our algorithms are simple and use techniques from coding theory. Let H be a parity-check matrix of a Maximum Distance Separable (MDS) code, and let G = (I | G') be a generator matrix of a (possibly different) MDS code in systematic form. Our deterministic algorithm uses fast rectangular matrix multiplication to check whether HAB = HC and H(AB)^T = H(C^T), and our randomized algorithm samples a uniformly random row g' from G' and checks whether g'AB = g'C and g'(AB)^T = g'C^T. We additionally study the complexity of MMV. We first show that all algorithms in a natural class of deterministic linear algebraic algorithms for MMV (including ours) require Ω(n^ω) time. We also show a barrier to proving a super-quadratic running time lower bound for matrix multiplication (and hence MMV) under the Strong Exponential Time Hypothesis (SETH). Finally, we study relationships between natural variants and special cases of MMV (with respect to deterministic Õ(n²)-time reductions).more » « less
-
Mikolaj Bojanczyk; Emanuela Merelli; David P. Woodruff (Ed.)Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. - Space O(n logσ) bits and query time O(log_σ^ε n); - Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and - Space O(n logσ log^ε_σ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely.more » « less
An official website of the United States government

