This content will become publicly available on December 31, 2024
Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different tradeoffs have been proposed.
This article introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constanttime operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to
Iceberg hashing introduces new generalpurpose techniques for some of the most basic aspects of hashtable design. Notably, our indirectionfree technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and veryhigh probability guarantees, can be applied to any hash table that makes use of the frontyard/backyard paradigm for hash table design.
more » « less NSFPAR ID:
 10485534
 Publisher / Repository:
 ACM
 Date Published:
 Journal Name:
 Journal of the ACM
 Volume:
 70
 Issue:
 6
 ISSN:
 00045411
 Page Range / eLocation ID:
 1 to 51
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. Stateoftheart hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constanttime insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the informationtheoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamicallyresized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)time insertions/deletions, O(1)time queries, and O(log(k) n) = Ologlog···logn k wasted bits per key (all with high probability in n). This means that, each time we increase inser tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constanttime hash tables succinct. Our results hold not just for fixedcapacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constanttime dynamic filter that uses nlogε−1+nloge+o(n) bits of space for a wide choice ofmore » « less

Abstract We establish rapid mixing of the randomcluster Glauber dynamics on random
regular graphs for all$$\varDelta $$ $\Delta $ and$$q\ge 1$$ $q\ge 1$ , where the threshold$$p $p<{p}_{u}(q,\Delta )$ corresponds to a uniqueness/nonuniqueness phase transition for the randomcluster model on the (infinite)$$p_u(q,\varDelta )$$ ${p}_{u}(q,\Delta )$ regular tree. It is expected that this threshold is sharp, and for$$\varDelta $$ $\Delta $ the Glauber dynamics on random$$q>2$$ $q>2$ regular graphs undergoes an exponential slowdown at$$\varDelta $$ $\Delta $ . More precisely, we show that for every$$p_u(q,\varDelta )$$ ${p}_{u}(q,\Delta )$ ,$$q\ge 1$$ $q\ge 1$ , and$$\varDelta \ge 3$$ $\Delta \ge 3$ , with probability$$p $p<{p}_{u}(q,\Delta )$ over the choice of a random$$1o(1)$$ $1o\left(1\right)$ regular graph on$$\varDelta $$ $\Delta $n vertices, the Glauber dynamics for the randomcluster model has mixing time. As a corollary, we deduce fast mixing of the Swendsen–Wang dynamics for the Potts model on random$$\varTheta (n \log n)$$ $\Theta (nlogn)$ regular graphs for every$$\varDelta $$ $\Delta $ , in the tree uniqueness region. Our proof relies on a sharp bound on the “shattering time”, i.e., the number of steps required to break up any configuration into$$q\ge 2$$ $q\ge 2$ sized clusters. This is established by analyzing a delicate and novel iterative scheme to simultaneously reveal the underlying random graph with clusters of the Glauber dynamics configuration on it, at a given time.$$O(\log n)$$ $O(logn)$ 
Abstract Sequence mappability is an important task in genome resequencing. In the (
k ,m )mappability problem, for a given sequenceT of lengthn , the goal is to compute a table whosei th entry is the number of indices such that the length$$j \ne i$$ $j\ne i$m substrings ofT starting at positionsi andj have at mostk mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of . We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for$$k=1$$ $k=1$ , works in$$k=O(1)$$ $k=O\left(1\right)$ space and, with high probability, in$$O(n)$$ $O\left(n\right)$ time. Our algorithm requires a careful adaptation of the$$O(n \cdot \min \{m^k,\log ^k n\})$$ $O(n\xb7min\{{m}^{k},{log}^{k}n\})$k errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the allpairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop time algorithms to compute$$O(n^2)$$ $O\left({n}^{2}\right)$all (k ,m )mappability tables for a fixedm and all or a fixed$$k\in \{0,\ldots ,m\}$$ $k\in \{0,\dots ,m\}$k and all . Finally, we show that, for$$m\in \{k,\ldots ,n\}$$ $m\in \{k,\dots ,n\}$ , the ($$k,m = \Theta (\log n)$$ $k,m=\Theta (logn)$k ,m )mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018. 
Abstract Background Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wetlab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.
Results In this work, we propose a more efficient model, called deep hash learning proteinandprotein interaction (DHLPPI), to predict allagainstall PPI relationships in a database of proteins. First, DHLPPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the prescreening stage of DHLPPI, the string matching problem of comparing a protein sequence against a database with
M proteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM . This prescreening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHLPPI uses the Hamming distance to verify the final PPI relationship.Conclusions The experimental results confirmed that DHLPPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHLPPI is shown to be superior or competitive when compared to the other stateoftheart methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHLPPI reduced the time complexity from
to$$O(M^2)$$ $O\left({M}^{2}\right)$ for performing an allagainstall PPI prediction for a database with$$O(M\log M)$$ $O(MlogM)$M proteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets. 
Abstract Background Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing
k merbased bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Localitysensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but stateoftheart methods still have large gaps.Results In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be
sensitive if any two sequences within an edit distance of$$(d_1, d_2)$$ $({d}_{1},{d}_{2})$ are mapped into at least one shared bucket, and any two sequences with distance at least$$d_1$$ ${d}_{1}$ are mapped into disjoint subsets of buckets. We construct localitysensitive bucketing (LSB) functions with a variety of values of$$d_2$$ ${d}_{2}$ and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.$$(d_1,d_2)$$ $({d}_{1},{d}_{2})$Conclusion These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.