skip to main content

This content will become publicly available on December 31, 2024

Title: Iceberg Hashing: Optimizing Many Hash-Table Criteria at Once

Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different trade-offs have been proposed.

This article introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constant-time operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to1 - o(1)while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are Θ (logn) bits, the space guarantees that Iceberg hashing offers, namely that it uses at most\(\log \binom{|U|}{n} + O(n \log \ \text{log} n)\)bits to storenitems from a universeU, matches a lower bound by Demaine et al. that applies to any stable hash table.

Iceberg hashing introduces new general-purpose techniques for some of the most basic aspects of hash-table design. Notably, our indirection-free technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and very-high probability guarantees, can be applied to any hash table that makes use of the front-yard/backyard paradigm for hash table design.

more » « less
Award ID(s):
2317838 1938709 2247577 2106827 1763680
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Date Published:
Journal Name:
Journal of the ACM
Page Range / eLocation ID:
1 to 51
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. State-of-the-art hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constant-time insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the information-theoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamically-resized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)-time insertions/deletions, O(1)-time queries, and O(log(k) n) = Ologlog···logn 􏰟 􏰞􏰝 􏰠 k wasted bits per key (all with high probability in n). This means that, each time we increase inser- tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constant-time hash tables succinct. Our results hold not just for fixed-capacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constant-time dynamic filter that uses n􏰕logε−1􏰖+nloge+o(n) bits of space for a wide choice of 
    more » « less
  2. Abstract

    We establish rapid mixing of the random-cluster Glauber dynamics on random$$\varDelta $$Δ-regular graphs for all$$q\ge 1$$q1and$$pp<pu(q,Δ), where the threshold$$p_u(q,\varDelta )$$pu(q,Δ)corresponds to a uniqueness/non-uniqueness phase transition for the random-cluster model on the (infinite)$$\varDelta $$Δ-regular tree. It is expected that this threshold is sharp, and for$$q>2$$q>2the Glauber dynamics on random$$\varDelta $$Δ-regular graphs undergoes an exponential slowdown at$$p_u(q,\varDelta )$$pu(q,Δ). More precisely, we show that for every$$q\ge 1$$q1,$$\varDelta \ge 3$$Δ3, and$$pp<pu(q,Δ), with probability$$1-o(1)$$1-o(1)over the choice of a random$$\varDelta $$Δ-regular graph onnvertices, the Glauber dynamics for the random-cluster model has$$\varTheta (n \log n)$$Θ(nlogn)mixing time. As a corollary, we deduce fast mixing of the Swendsen–Wang dynamics for the Potts model on random$$\varDelta $$Δ-regular graphs for every$$q\ge 2$$q2, in the tree uniqueness region. Our proof relies on a sharp bound on the “shattering time”, i.e., the number of steps required to break up any configuration into$$O(\log n)$$O(logn)sized clusters. This is established by analyzing a delicate and novel iterative scheme to simultaneously reveal the underlying random graph with clusters of the Glauber dynamics configuration on it, at a given time.

    more » « less
  3. Abstract

    Sequence mappability is an important task in genome resequencing. In the (km)-mappability problem, for a given sequenceTof lengthn, the goal is to compute a table whoseith entry is the number of indices$$j \ne i$$jisuch that the length-msubstrings ofTstarting at positionsiandjhave at mostkmismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of$$k=1$$k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for$$k=O(1)$$k=O(1), works in$$O(n)$$O(n)space and, with high probability, in$$O(n \cdot \min \{m^k,\log ^k n\})$$O(n·min{mk,logkn})time. Our algorithm requires a careful adaptation of thek-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop$$O(n^2)$$O(n2)-time algorithms to computeall(km)-mappability tables for a fixedmand all$$k\in \{0,\ldots ,m\}$$k{0,,m}or a fixedkand all$$m\in \{k,\ldots ,n\}$$m{k,,n}. Finally, we show that, for$$k,m = \Theta (\log n)$$k,m=Θ(logn), the (km)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.

    more » « less
  4. Abstract Background

    Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.


    In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database withMproteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM. This pre-screening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHL-PPI uses the Hamming distance to verify the final PPI relationship.


    The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from$$O(M^2)$$O(M2)to$$O(M\log M)$$O(MlogM)for performing an all-against-all PPI prediction for a database withMproteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.

    more » « less
  5. Abstract Background

    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existingk-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps.


    In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be$$(d_1, d_2)$$(d1,d2)-sensitive if any two sequences within an edit distance of$$d_1$$d1are mapped into at least one shared bucket, and any two sequences with distance at least$$d_2$$d2are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of$$(d_1,d_2)$$(d1,d2)and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.


    These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

    more » « less