skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Mitigating False Positives in Filters: to Adapt or to Cache?
A filter is adaptive if it achieves a false positive rate of " on each query independently of the answers to previous queries. Many popular filters such as Bloom filters are not adaptive—an adversary could repeat a false-positive query many times to drive the false-positive rate to 1. Bender et al. [4] formalized the definition of adaptivity and gave a provably adaptive filter, the broom filter. Mitzenmacher et al. [20] gave a filter that achieves a lower empirical false- positive rate by exploiting repetitions. We prove that an adaptive filter has a lower false- positive rate when the adversary is stochastic. Specifically, we analyze the broom filter against queries drawn from a Zipfian distribution. We validate our analysis empirically by showing that the broom filter achieves a low false-positive rate on both network traces and synthetic datasets, even when compared to a regular filter augmented with a cache for storing frequently queried items.  more » « less
Award ID(s):
1938180 2106999 2118620
PAR ID:
10298510
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
SIAM Symposium on Algorithmic Principles of Computer System
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Filters trade off accuracy for space and occasionally return false positive matches with a bounded error. Numerous systems use filters in fast memory to avoid performing expensive I/Os to slow storage. A fundamental limitation in traditional filters is that they do not change their representation upon seeing a false positive match. Therefore, the maximum false positive rate is only guaranteed for a single query, not for an arbitrary set of queries. We can improve the filter's performance on a stream of queries, especially on a skewed distribution, if we can adapt after encountering false positives. Adaptive filters, such as telescoping quotient filters and adaptive cuckoo filters, update their representation upon detecting a false positive to avoid repeating the same error in the future. Adaptive filters require an auxiliary structure, typically much larger than the main filter and often residing on slow storage, to facilitate adaptation. However, existing adaptive filters are not practical and have not been adopted in real-world systems for two main reasons. First, they offer weak adaptivity guarantees, meaning that fixing a new false positive can cause a previously fixed false positive to come back. Secondly, the sub-optimal design of the auxiliary structure results in adaptivity overheads so substantial that they can actually diminish overall system performance compared to a traditional filter. In this paper, we design and implement the \sysname, the first practical adaptive filter with minimal adaptivity overhead and strong adaptivity guarantees, which means that the performance and false-positive guarantees continue to hold even for adversarial workloads. The \sysname is based on the state-of-the-art quotient filter design and preserves all the critical features of the quotient filter such as cache efficiency and mergeability. Furthermore, we employ a new auxiliary structure design which results in considerably low adaptivity overhead and makes the \sysname practical in real systems. We evaluate the \sysname by using it to filter queries to an on-disk B-tree database and find no negative impact on insert or query performance compared to traditional filters. Against adversarial workloads, the \sysname preserves system performance, whereas traditional filters incur 2× slowdown from adversaries representing as low as 1% of the workload. Finally, we show that on skewed query workloads, the \sysname can reduce the false-positive rate 100× using negligible (1/1000th of a bit per item) space overhead. 
    more » « less
  2. We present the Succinct Range Filter (SuRF), a fast and compact data structure for approximate membership tests. Unlike traditional Bloom filters, SuRF supports both single-key lookups and common range queries: open-range queries, closed-range queries, and range counts. SuRF is based on a new data structure called the Fast Succinct Trie (FST) that matches the point and range query performance of state-of-the-art order-preserving indexes, while consuming only 10 bits per trie node. The false positive rates in SuRF for both point and range queries are tunable to satisfy different application needs. We evaluate SuRF in RocksDB as a replacement for its Bloom filters to reduce I/O by filtering requests before they access on-disk data structures. Our experiments on a 100 GB dataset show that replacing RocksDB's Bloom filters with SuRFs speeds up open-seek (without upper-bound) and closed-seek (with upper-bound) queries by up to 1.5× and 5× with a modest cost on the worst-case (all-missing) point query throughput due to slightly higher false positive rate. 
    more » « less
  3. We study local filters for the Lipschitz property of real-valued functions f : V → [0,r], where the Lipschitz property is defined with respect to an arbitrary undirected graph G = (V, E ). We give nearly optimal local Lipschitz filters both with respect to ℓ1-distance and ℓ0-distance. Previous work only considered unbounded- range functions over [n]d. Jha and Raskhodnikova (SICOMP ‘13) gave an algorithm for such functions with lookup complexity exponential in d, which Awasthi et al. (ACM Trans. Comput. Theory) showed was necessary in this setting. We demonstrate that important applications of local Lipschitz filters can be accomplished with filters for functions whose range is bounded in [0,r]. For functions f : [n]d → [0,r], we achieve running time (dr log n )O (log r ) for the ℓ1-respecting filter and dO(r) polylog n for the ℓ0-respecting filter, thus circumventing the lower bound. Our local filters provide a novel Lipschitz extension that can be implemented locally. Furthermore, we show that our algorithms are nearly optimal in terms of the dependence on r for the domain {0,1}d, an important special case of the domain [n]d. In addition, our lower bound resolves an open question of Awasthi et al., removing one of the conditions necessary for their lower bound for general range. We prove our lower bound via a reduction from distribution-free Lipschitz testing and a new technique for proving hardness for adaptive algorithms. Finally, we provide two applications of our local filters to real-valued functions, with no restrictions on the range. In the first application, we use them in conjunction with the Laplace mechanism for differential privacy and noisy binary search to provide mechanisms for privately releasing outputs of black-box functions, even in the presence of malicious clients. In particular, our differentially private mechanism for arbitrary real-valued functions runs in time 2polylog min(r,nd ) and, for honest clients, has accuracy comparable to the Laplace mechanism for Lipschitz functions, up to a factor of O (log min(r,nd )). In the second application, we use our local filters to obtain the first nontrivial tolerant tester for the Lipschitz property. Our tester works for functions of the form f : {0,1}d → ℝ, makes queries, and has tolerance ratio 2.01. Our applications demonstrate that local filters for bounded-range functions can be applied to construct efficient algorithms for arbitrary real-valued functions. 
    more » « less
  4. Bloom Filters are a space-efficient data structure used for the testing of membership in a set that errs only in the False Positive direction. However, the standard analysis that measures this False Positive rate provides a form of worst case bound that is both overly conservative for the majority of network applications that utilize Bloom Filters, and reduces accuracy by not taking into account the actual state (number of bits set) of the Bloom Filter after each arrival. In this paper, we more accurately characterize the False Positive dynamics of Bloom Filters as they are commonly used in networking applications. In particular, network applications often utilize a Bloom Filter that “recycles”: it repeatedly fills, and upon reaching a certain level of saturation, empties and fills again. In this context, it makes more sense to evaluate performance using the average False Positive rate instead of the worst case bound. We show how to efficiently compute the average False Positive rate of recycling Bloom Filter variants via renewal and Markov models. We apply our models to both the standard Bloom Filter and a “two-phase” variant, verify the accuracy of our model with simulations, and find that the previous analysis’ worst-case formulation leads to up to a 30% reduction in the efficiency of Bloom Filter when applied in network applications, while two-phase overhead diminishes as the needed False Positive rate is tightened. 
    more » « less
  5. We present Sparse Numerical Array-Based Range Filters (SNARF), a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads. 
    more » « less