skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Constructing the CDAWG CFG using LCP-Intervals
It is known that a context-free grammar (CFG) that produces a single string can be derived from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we show that the CFG derived from a CDAWG is deeply connected to the maximal repeat content of the string it produces and thus has O(m) rules, where m is the number of maximal repeats in the string. We then provide a generic algorithm based on this insight for constructing the CFG from the LCP-intervals of a string in O(n) time, where n is the length of the string. This includes a novel data-structure to support stabbing queries on LCPintervals in O(1+k) time after O(n) preprocessing time, where k is the number of intervals stabbed. These results connect the CFG to properties of the string it produces and relates it to other string data-structures, allowing it to be studied independently of the CDAWG and providing opportunity for innovation of grammar-based compression algorithms.  more » « less
Award ID(s):
2105391
PAR ID:
10429842
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2023 Data Compression Conference
Page Range / eLocation ID:
178 to 187
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gørtz, Inge Li; Farach-Colton, Martin; Puglisi, Simon J; Herman, Grzegorz (Ed.)
    We give the first almost-linear time algorithm for computing the maximal k-edge-connected subgraphs of an undirected unweighted graph for any constant k. More specifically, given an n-vertex m-edge graph G = (V,E) and a number k = log^o(1) n, we can deterministically compute in O(m+n^{1+o(1)}) time the unique vertex partition {V_1,… ,V_z} such that, for every i, V_i induces a k-edge-connected subgraph while every superset V'_i ⊃ V_{i} does not. Previous algorithms with linear time work only when k ≤ 2 [Tarjan SICOMP'72], otherwise they all require Ω(m+n√n) time even when k = 3 [Chechik et al. SODA'17; Forster et al. SODA'20]. Our algorithm also extends to the decremental graph setting; we can deterministically maintain the maximal k-edge-connected subgraphs of a graph undergoing edge deletions in m^{1+o(1)} total update time. Our key idea is a reduction to the dynamic algorithm supporting pairwise k-edge-connectivity queries [Jin and Sun FOCS'20]. 
    more » « less
  2. Nissim, K.; Waters, B. (Ed.)
    Recent new constructions of rate-1 OT [Döttling, Garg, Ishai, Malavolta, Mour, and Ostrovsky, CRYPTO 2019] have brought this primitive under the spotlight and the techniques have led to new feasibility results for private-information retrieval, and homomorphic encryption for branching programs. The receiver communication of this construction consists of a quadratic (in the sender's input size) number of group elements for a single instance of rate-1 OT. Recently [Garg, Hajiabadi, Ostrovsky, TCC 2020] improved the receiver communication to a linear number of group elements for a single string-OT. However, most applications of rate-1 OT require executing it multiple times, resulting in large communication costs for the receiver. In this work, we introduce a new technique for amortizing the cost of multiple rate-1 OTs. Specifically, based on standard pairing assumptions, we obtain a two-message rate-1 OT protocol for which the amortized cost per string-OT is asymptotically reduced to only four group elements. Our results lead to significant communication improvements in PSI and PIR, special cases of SFE for branching programs. - PIR: We obtain a rate-1 PIR scheme with client communication cost of $$O(\lambda\cdot\log N)$$ group elements for security parameter $$\lambda$$ and database size $$N$$. Notably, after a one-time setup (or one PIR instance), any following PIR instance only requires communication cost $$O(\log N)$$ number of group elements. - PSI with unbalanced inputs: We apply our techniques to private set intersection with unbalanced set sizes (where the receiver has a smaller set) and achieve receiver communication of $$O((m+\lambda) \log N)$$ group elements where $m, N$ are the sizes of the receiver and sender sets, respectively. Similarly, after a one-time setup (or one PSI instance), any following PSI instance only requires communication cost $$O(m \cdot \log N)$$ number of group elements. All previous sublinear-communication non-FHE based PSI protocols for the above unbalanced setting were also based on rate-1 OT, but incurred at least $$O(\lambda^2 m \log N)$$ group elements. 
    more » « less
  3. Censor-Hillel, Keren; Grandoni, Fabrizio; Ouaknine, Joel; Puppis, Gabriele (Ed.)
    We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters). 
    more » « less
  4. Mutzel, Petra and (Ed.)
    Grammar compression is, next to Lempel-Ziv (LZ77) and run-length Burrows-Wheeler transform (RLBWT), one of the most flexible approaches to representing and processing highly compressible strings. The main idea is to represent a text as a context-free grammar whose language is precisely the input string. This is called a straight-line grammar (SLG). An AVL grammar, proposed by Rytter [Theor. Comput. Sci., 2003] is a type of SLG that additionally satisfies the AVL property: the heights of parse trees for children of every nonterminal differ by at most one. In contrast to other SLG constructions, AVL grammars can be constructed from the LZ77 parsing in compressed time: 𝒪(z log n) where z is the size of the LZ77 parsing and n is the length of the input text. Despite these advantages, AVL grammars are thought to be too large to be practical. We present a new technique for rapidly constructing a small AVL grammar from an LZ77 or LZ77-like parse. Our algorithm produces grammars that are always at least five times smaller than those produced by the original algorithm, and usually not more than double the size of grammars produced by the practical Re-Pair compressor [Larsson and Moffat, Proc. IEEE, 2000]. Our algorithm also achieves low peak RAM usage. By combining this algorithm with recent advances in approximating the LZ77 parsing, we show that our method has the potential to construct a run-length BWT in about one third of the time and peak RAM required by other approaches. Overall, we show that AVL grammars are surprisingly practical, opening the door to much faster construction of key compressed data structures 
    more » « less
  5. A Simple Temporal Network with Uncertainty (STNU) is a data structure for reasoning about time constraints on actions that may have uncertain durations. An STNU is dispatchable if it can be executed in real-time with minimal computation 1) satisfying all constraints no matter how the uncertain durations play out and 2) retaining maximum flexibility. The fastest known algorithm for converting STNUs into dispatchable form runs in O(n3) time, where n is the number of timepoints. This paper presents a faster algorithm that runs in O(mn + kn2 + n2 logn) time, where m is the number of edges and k is the number of uncertain durations. This performance is particularly meaningful in fields like Business Process Management, where sparse STNUs can represent temporal processes or plans. For sparse STNUs, our algorithm generates dispatchable forms in time O(n2 logn), a significant improvement over the O(n3)-time previous fastest algorithm. 
    more » « less