In the unit-cost comparison model, a black box takes an input two items and outputs the result of the comparison. Problems like sorting and searching have been studied in this model, and it has been general- ized to include the concept of priced information, where different pairs of items (say database records) have different comparison costs. These comparison costs can be arbitrary (in which case no algorithm can be close to optimal (Charikar et al. STOC 2000)), structured (for exam- ple, the comparison cost may depend on the length of the databases (Gupta et al. FOCS 2001)), or stochastic (Angelov et al. LATIN 2008). Motivated by the database setting where the cost depends on the sizes of the items, we consider the problems of sorting and batched predecessor where two non-uniform sets of items A and B are given as input. (1) In the RAM setting, we consider the scenario where both sets have n keys each. The cost to compare two items in A is a, to compare an item of A to an item of B is b, and to compare two items in B is c. We give upper and lower bounds for the case a ≤ b ≤ c, the case that serves as a warmup for the generalization to the external-memory model. Notice that the case b = 1,a = c = ∞ is the famous “nuts and bolts” problem. ) In the Disk-Access Model (DAM), where transferring elements between disk and internal memory is the main bottleneck, we con- sider the scenario where elements in B are larger than elements in A. The larger items take more I/Os to be brought into memory, consume more space in internal memory, and are required in their entirety for comparisons. A key observation is that the complexity of sorting depends heavily on the interleaving of the small and large items in the final sorted order. If all large elements come after all small elements in the final sorted order, sorting each type separately and concatenating is optimal. However, if the set of predecessors of B in A has size k ≪ n, one must solve an associated batched predecessor problem in order to achieve optimality. We first give output-sensitive lower and upper bounds on the batched predecessor problem, and use these to derive bounds on the complexity of sorting in the two models. Our bounds are tight in most cases, and require novel generalizations of the classical lower bound techniques in external memory to accommodate the non-uniformity of keys. 
                        more » 
                        « less   
                    
                            
                            Solving the Minimal Positional Substring Cover Problem in Sublinear Space
                        
                    
    
            Within the field of haplotype analysis, the Positional Burrows-Wheeler Transform (PBWT) stands out as a key innovation, addressing numerous challenges in genomics. For example, Sanaullah et al. introduced a PBWT-based method that addresses the haplotype threading problem, which involves representing a query haplotype through a minimal set of substrings. To solve this problem using the PBWT data structure, they formulate the Minimal Positional Substring Cover (MPSC) problem, and then, subsequently present a solution for it. Additionally, they present and solve several variants of this problem: k-MPSC, leftmost MPSC, rightmost MPSC, and length-maximal MPSC. Yet, a full PBWT is required for each of their solutions, which yields a significant memory usage requirement. Here, we take advantage of the latest results on run-length encoding the PBWT, to solve the MPSC in a sublinear amount of space. Our methods involve demonstrating that k-Set Maximal Exact Matches (k-SMEMs) can be computed in a sublinear amount of space via efficient computation of k-Matching Statistics (k-MS). This leads to a solution that requires sublinear space for, not only the MPSC problem, but for all its variations proposed by Sanaullah et al. Most importantly, we present experimental results on haplotype panels from the 1000 Genomes Project data that show the utility of these theoretical results. We conclusively demonstrate that our approach markedly decreases the memory required to solve the MPSC problem, achieving a reduction of at least two orders of magnitude compared to the method proposed by Sanaullah et al. This efficiency allows us to solve the problem on large versions of the problem, where other methods are unable to scale to. In summary, the creation of {μ}-PBWT paves the way for new possibilities in conducting in-depth genetic research and analysis on a large scale. All source code is publicly available at https://github.com/dlcgold/muPBWT/tree/k-smem. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2013998
- PAR ID:
- 10549032
- Editor(s):
- Inenaga, Shunsuke; Puglisi, Simon J
- Publisher / Repository:
- Schloss Dagstuhl – Leibniz-Zentrum für Informatik
- Date Published:
- Volume:
- 296
- ISSN:
- 1868-8969
- ISBN:
- 978-3-95977-326-3
- Page Range / eLocation ID:
- 296-296
- Subject(s) / Keyword(s):
- Positional Burrows-Wheeler Transform r-index minimal position substring cover set-maximal exact matches Theory of computation → Data structures design and analysis
- Format(s):
- Medium: X Size: 16 pages; 874410 bytes Other: application/pdf
- Size(s):
- 16 pages 874410 bytes
- Right(s):
- Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Censor-Hillel, Keren; Grandoni, Fabrizio; Ouaknine, Joel; Puppis, Gabriele (Ed.)We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).more » « less
- 
            We present new results on a number of fundamental problems about dynamic geometric data structures: 1) We describe the first fully dynamic data structures with sublinear amortized update time for maintaining (i) the number of vertices or the volume of the convex hull of a 3D point set, (ii) the largest empty circle for a 2D point set, (iii) the Hausdorff distance between two 2D point sets, (iv) the discrete 1-center of a 2D point set, (v) the number of maximal (i.e., skyline) points in a 3D point set. The update times are near n^{11/12} for (i) and (ii), n^{7/8} for (iii) and (iv), and n^{2/3} for (v). Previously, sublinear bounds were known only for restricted "semi-online" settings [Chan, SODA 2002]. 2) We slightly improve previous fully dynamic data structures for answering extreme point queries for the convex hull of a 3D point set and nearest neighbor search for a 2D point set. The query time is O(log^2n), and the amortized update time is O(log^4n) instead of O(log^5n) [Chan, SODA 2006; Kaplan et al., SODA 2017]. 3) We also improve previous fully dynamic data structures for maintaining the bichromatic closest pair between two 2D point sets and the diameter of a 2D point set. The amortized update time is O(log^4n) instead of O(log^7n) [Eppstein 1995; Chan, SODA 2006; Kaplan et al., SODA 2017].more » « less
- 
            A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work (Chakraborty et al. 2013 and Cannone et al. 2014) has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary. Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points. Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models.more » « less
- 
            Belkin, Mikhail; Kpotufe, Samor (Ed.)We present an $$e^{O(p)} (\log \ell) / (\log \log \ell)$$-approximation algorithm for socially fair clustering with the $$\ell_p$$-objective. In this problem, we are given a set of points in a metric space. Each point belongs to one (or several) of $$\ell$$ groups. The goal is to find a $$k$$-medians, $$k$$-means, or, more generally, $$\ell_p$$-clustering that is simultaneously good for all of the groups. More precisely, we need to find a set of $$k$$ centers $$C$$ so as to minimize the maximum over all groups $$j$$ of $$\sum_{u \text{ in group } j} d(u, C)^p$$. The socially fair clustering problem was independently proposed by Abbasi, Bhaskara, and Venkatasubramanian (2021) and Ghadiri, Samadi, and Vempala (2021). Our algorithm improves and generalizes their $$O(\ell)$$-approximation algorithms for the problem. The natural LP relaxation for the problem has an integrality gap of $$\Omega(\ell)$$. In order to obtain our result, we introduce a strengthened LP relaxation and show that it has an integrality gap of $$\Theta((\log \ell) / (\log \log \ell))$$ for a fixed p. Additionally, we present a bicriteria approximation algorithm, which generalizes the bicriteria approximation of Abbasi et al. (2021).more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    