skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fast, Parallel, and Cache-Friendly Suffix Array Construction
String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.  more » « less
Award ID(s):
1763680
PAR ID:
10485536
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Belazzougui, Djamal; Ouangraoua, Aïda
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Journal Name:
23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)
Subject(s) / Keyword(s):
Suffix Array Longest Common Prefix Data Structures Indexing Parallel Algorithms Theory of computation → Sorting and searching
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The suffix array is a crucial data structure for efficient string analysis. Over the course of twenty-six years, sequential suffix array construction algorithms have achieved O(n) time complexity and in-place sorting. In this paper, we present the Tunnel algorithm, the first large-scale parallel suffix array construction algorithm with a time complexity of O(n/p) based on the parallel random access machine (PRAM) model. The Tunnel algorithm is built on three key ideas: dividing the problem of size O(n) into p sub-problems of reduced size O(n/p) by replacing long suffixes with shorter prefixes of size at most a constant D ; introducing a Tunnel mechanism to efficiently induce the order of a set of suffixes with long common prefixes; developing a strategy to transform a partially ordered suffix set into a total order relation by iteratively applying the Tunnel inducing method. We provide a detailed description of the algorithm, along with a thorough analysis of its time and space complexity, to demonstrate its correctness and efficiency. The proposed Tunnel algorithm exhibits scalable performance, making it suitable for large string analytics on large-scale parallel systems. 
    more » « less
  2. It is known that a context-free grammar (CFG) that produces a single string can be derived from the compact directed acyclic word graph (CDAWG) for the same string. In this work, we show that the CFG derived from a CDAWG is deeply connected to the maximal repeat content of the string it produces and thus has O(m) rules, where m is the number of maximal repeats in the string. We then provide a generic algorithm based on this insight for constructing the CFG from the LCP-intervals of a string in O(n) time, where n is the length of the string. This includes a novel data-structure to support stabbing queries on LCPintervals in O(1+k) time after O(n) preprocessing time, where k is the number of intervals stabbed. These results connect the CFG to properties of the string it produces and relates it to other string data-structures, allowing it to be studied independently of the CDAWG and providing opportunity for innovation of grammar-based compression algorithms. 
    more » « less
  3. Mikolaj Bojanczyk; Emanuela Merelli; David P. Woodruff (Ed.)
    Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T[1,n] over an alphabet Σ of size σ takes O(nlog n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T[i,i+|P|-1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P|log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text’s size of nlogσ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(nlogσ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(nlog n) bits (when logσ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. - Space O(n logσ) bits and query time O(log_σ^ε n); - Space O(n logσlog log_σ n) bits and query time O(log log_σ n); and - Space O(n logσ log^ε_σ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.’s result, whereas our third trade-off matches the optimal time performance of Baker’s PST while squeezing the space by a factor roughly log_σ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely. 
    more » « less
  4. Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented by strings and suffix arrays are highly efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the provided suffix array integration method and measure the end-to-end performance. Experimental results show that the proposed solution can provide data scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub (https://github.com/Bader-Research/arkouda/tree/string-suffix-array-functionality). 
    more » « less
  5. Blockchain and distributed ledger technologies (DLT) are emerging decentralized infrastructures touted by researchers to improve existing systems that have been limited by centralized governance and proprietary control. These technologies have shown continued success in sustaining the operational models of modern cryptocurrencies and decentralized finance applications (DeFi). These applications has incentivized growing discussions in their potential applications and adoption in other sectors such as healthcare, which has a high demand for data liquidity and interoperability. Despite the increasing research efforts in adopting blockchain and DLT in healthcare with conceptual designs and prototypes, a major research gap exists in literature: there is a lack of design recommendations that discuss concrete architectural styles and domain-specific considerations that are necessary for implementing health data exchange systems based on these technologies. This paper aims to address this gap in research by introducing a collection of design patterns for constructing blockchain and DLT-based healthcare systems that support secure and scalable data sharing. Our approach adapts traditional software patterns and proposes novel patterns that take into account both the technical requirements specific to healthcare systems and the implications of these requirements on naive blockchain-based solutions. 
    more » « less