We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both singlekey lookups and common range queries: openrange queries, closedrange queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of stateoftheart orderpreserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access ondisk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up openseek (without upperbound) and closedseek (with upperbound) queries by up to 1.5× and 5× with a modest cost on the worstcase (allmissing) point query throughput due to slightly higher false positive rate.
more »
« less
Mitigating False Positives in Filters: to Adapt or to Cache?
A filter is adaptive if it achieves a false positive rate of " on each query independently of the answers to previous queries. Many popular filters such as Bloom filters are not adaptive—an adversary could repeat a falsepositive query many times to drive the falsepositive rate to 1. Bender et al. [4] formalized the definition of adaptivity and gave a provably adaptive filter, the broom filter. Mitzenmacher et al. [20] gave a filter that achieves a lower empirical false positive rate by exploiting repetitions.
We prove that an adaptive filter has a lower false positive rate when the adversary is stochastic. Specifically, we analyze the broom filter against queries drawn from a Zipfian distribution. We validate our analysis empirically by showing that the broom filter achieves a low falsepositive rate on both network traces and synthetic datasets, even when compared to a regular filter augmented with a cache for storing frequently queried items.
more »
« less
 NSFPAR ID:
 10298510
 Date Published:
 Journal Name:
 SIAM Symposium on Algorithmic Principles of Computer System
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Bloom Filters are a spaceefficient data structure used for the testing of membership in a set that errs only in the False Positive direction. However, the standard analysis that measures this False Positive rate provides a form of worst case bound that is both overly conservative for the majority of network applications that utilize Bloom Filters, and reduces accuracy by not taking into account the actual state (number of bits set) of the Bloom Filter after each arrival. In this paper, we more accurately characterize the False Positive dynamics of Bloom Filters as they are commonly used in networking applications. In particular, network applications often utilize a Bloom Filter that “recycles”: it repeatedly fills, and upon reaching a certain level of saturation, empties and fills again. In this context, it makes more sense to evaluate performance using the average False Positive rate instead of the worst case bound. We show how to efficiently compute the average False Positive rate of recycling Bloom Filter variants via renewal and Markov models. We apply our models to both the standard Bloom Filter and a “twophase” variant, verify the accuracy of our model with simulations, and find that the previous analysis’ worstcase formulation leads to up to a 30% reduction in the efficiency of Bloom Filter when applied in network applications, while twophase overhead diminishes as the needed False Positive rate is tightened.more » « less

We present Sparse Numerical ArrayBased Range Filters (SNARF), a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and realworld datasets as a standalone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than stateoftheart range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access ondisk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain readonly workloads.more » « less

We study dynamic algorithms robust to adaptive input generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the input are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first recall a unified framework of [HKM+20, BKM+22, ACSS23] which is roughly a quadratic improvement over the na ̈ıve implementation, and only incurs a logarithmic overhead in query time. Although the general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, and point queries and serves as a preliminary benchmark, we demonstrate even better algorithmic improvements for (1) reducing the preprocessing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation. Finally, we complement our theoretical results with additional empirical evaluations.more » « less

The problem of testing monotonicity for Boolean functions on the hypergrid, $f:[n]^d \to \{0,1\}$ is a classic topic in property testing. When $n=2$, the domain is the hypercube. For the hypercube case, a breakthrough result of KhotMinzerSafra (FOCS 2015) gave a nonadaptive, onesided tester making $\otilde(\eps^{2}\sqrt{d})$ queries. Up to polylog $d$ and $\eps$ factors, this bound matches the $\widetilde{\Omega}(\sqrt{d})$query nonadaptive lower bound (ChenDeServedioTan (STOC 2015), ChenWaingartenXie (STOC 2017)). For any $n > 2$, the optimal nonadaptive complexity was unknown. A previous result of the authors achieves a $\otilde(d^{5/6})$query upper bound (SODA 2020), quite far from the $\sqrt{d}$ bound for the hypercube. In this paper, we resolve the nonadaptive complexity of monotonicity testing for all constant $n$, up to $\poly(\eps^{1}\log d)$ factors. Specifically, we give a nonadaptive, onesided monotonicity tester making $\otilde(\eps^{2}n\sqrt{d})$ queries. From a technical standpoint, we prove new directed isoperimetric theorems over the hypergrid $[n]^d$. These results generalize the celebrated directed Talagrand inequalities that were only known for the hypercube.more » « less