skip to main content


Title: Locally Differentially Private Frequency Estimation with Consistency
Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.  more » « less
Award ID(s):
1640374
NSF-PAR ID:
10194802
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
NDSS'20: Proceedings of the NDSS Symposium
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The notion of Local Differential Privacy (LDP) enables users to respond to sensitive questions while preserving their privacy. The basic LDP frequent oracle (FO) protocol enables an aggregator to estimate the frequency of any value. But when each user has a set of values, one needs an additional padding and sampling step to find the frequent values and estimate their frequencies. In this paper, we formally define such padding and sample based frequency oracles (PSFO). We further identify the privacy amplification property in PSFO. As a result, we propose SVIM, a protocol for finding frequent items in the set-valued LDP setting. Experiments show that under the same privacy guarantee and computational cost, SVIM significantly improves over existing methods. With SVIM to find frequent items, we propose SVSM to effectively find frequent itemsets, which to our knowledge has not been done before in the LDP setting. 
    more » « less
  2. Protocols satisfying Local Differential Privacy (LDP) enable parties to collect aggregate information about a population while protecting each user’s privacy, without relying on a trusted third party. LDP protocols (such as Google’s RAPPOR) have been deployed in real-world scenarios. In these protocols, a user encodes his private information and perturbs the encoded value locally before sending it to an aggregator, who combines values that users contribute to infer statistics about the population. In this paper, we introduce a framework that generalizes several LDP protocols proposed in the literature. Our framework yields a simple and fast aggregation algorithm, whose accuracy can be precisely analyzed. Our in-depth analysis enables us to choose optimal parameters, resulting in two new protocols (i.e., Optimized Unary Encoding and Optimized Local Hashing) that provide better utility than protocols previously proposed. We present precise conditions for when each proposed protocol should be used, and perform experiments that demonstrate the advantage of our proposed protocols. 
    more » « less
  3. Top-k frequent items detection is a fundamental task in data stream mining. Many promising solutions are proposed to improve memory efficiency while still maintaining high accuracy for detecting the Top-k items. Despite the memory efficiency concern, the users could suffer from privacy loss if participating in the task without proper protection, since their contributed local data streams may continually leak sensitive individual information. However, most existing works solely focus on addressing either the memory-efficiency problem or the privacy concerns but seldom jointly, which cannot achieve a satisfactory tradeoff between memory efficiency, privacy protection, and detection accuracy.

    In this paper, we present a novel framework HG-LDP to achieve accurate Top-k item detection at bounded memory expense, while providing rigorous local differential privacy (LDP) protection. Specifically, we identify two key challenges naturally arising in the task, which reveal that directly applying existing LDP techniques will lead to an inferior accuracy-privacy-memory efficiency tradeoff. Therefore, we instantiate three advanced schemes under the framework by designing novel LDP randomization methods, which address the hurdles caused by the large size of the item domain and by the limited space of the memory. We conduct comprehensive experiments on both synthetic and real-world datasets to show that the proposed advanced schemes achieve a superior accuracy-privacy-memory efficiency tradeoff, saving 2300× memory over baseline methods when the item domain size is 41,270. Our code is anonymously open-sourced via the link.

     
    more » « less
  4. Data sets and statistics about groups of individuals are increasingly collected and released, feeding many optimization and learning algorithms. In many cases, the released data contain sensitive information whose privacy is strictly regulated. For example, in the U.S., the census data is regulated under Title 13, which requires that no individual be identified from any data released by the Census Bureau. In Europe, data release is regulated according to the General Data Protection Regulation, which addresses the control and transfer of personal data. Differential privacy has emerged as the de-facto standard to protect data privacy. In a nutshell, differentially private algorithms protect an individual’s data by injecting random noise into the output of a computation that involves such data. While this process ensures privacy, it also impacts the quality of data analysis, and, when private data sets are used as inputs to complex machine learning or optimization tasks, they may produce results that are fundamentally different from those obtained on the original data and even rise unintended bias and fairness concerns. In this talk, I will first focus on the challenge of releasing privacy-preserving data sets for complex data analysis tasks. I will introduce the notion of Constrained-based Differential Privacy (C-DP), which allows casting the data release problem to an optimization problem whose goal is to preserve the salient features of the original data. I will review several applications of C-DP in the context of very large hierarchical census data, data streams, energy systems, and in the design of federated data-sharing protocols. Next, I will discuss how errors induced by differential privacy algorithms may propagate within a decision problem causing biases and fairness issues. This is particularly important as privacy-preserving data is often used for critical decision processes, including the allocation of funds and benefits to states and jurisdictions, which ideally should be fair and unbiased. Finally, I will conclude with a roadmap to future work and some open questions. 
    more » « less
  5. Groundwater discharge though streambeds is often focused toward discrete zones, indicating that preliminary reconnaissance may be useful for capturing the full spectrum of groundwater discharge rates using point-scale quantitative methods. However, many direct-contact reconnaissance techniques can be time-consuming, and remote sensing (e.g., thermal infrared) typically does not penetrate the water column to locate submerged seepages. In this study, we tested whether dozens of groundwater discharge measurements made at “uninformed” (i.e., selected without knowledge on high-resolution temperature variations at the streambed) point locations along a reach would yield significantly different Darcy-based groundwater discharge rates when compared with “informed” measurements, focused at streambed thermal anomalies that were identified a priori using fiber-optic distributed temperature sensing (FO-DTS). A non-parametric U-test showed a significant difference between median discharge rates for uninformed (0.05 m·day−1; n = 30) and informed (0.17 m·day−1; n = 20) measurement locations. Mean values followed a similar pattern (0.12 versus 0.27 m·day−1), and frequency distributions for uninformed and informed measurements were also significantly different based on a Kolmogorov–Smirnov test. Results suggest that even using a quick “snapshot-in-time” field analysis of FO-DTS data can be useful in streambeds with groundwater discharge rates <0.2 m·day−1, a lower threshold than proposed in a previous study. Collectively, study results highlight that FO-DTS is a powerful technique for identifying higher-discharge zones in streambeds, but the pros and cons of informed and uninformed sampling depend in part on groundwater/surface water exchange study goals. For example, studies focused on measuring representative groundwater and solute fluxes may be biased if high-discharge locations are preferentially sampled. However, identification of high-discharge locations may complement more randomized sampling plans and lead to improvements in interpolating streambed fluxes and upscaling point measurements to the stream reach scale. 
    more » « less