skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Title: MGD: A Utility Metric for Private Data Publication
techniques to protect user data privacy. A common way for utilizing private data under DP is to take an input dataset and synthesize a new dataset that preserves features of the input dataset while satisfying DP. A trade-off always exists between the strength of privacy protection and the utility of the final output: stronger privacy protection requires larger randomness, so the outputs usually have a larger variance and can be far from optimal. In this paper, we summarize our proposed metric for the NIST “A Better Meter Stick for Differential Privacy” competition [26], MarGinal Difference (MGD), for measuring the utility of a synthesized dataset. Our metric is based on earth mover distance. We introduce new features in our metric so that it is not affected by some small random noise that is unavoidable in the DP context but focuses more on the significant difference. We show that our metric can reflect the range query error better compared with other existing metrics. We introduce an efficient computation method based on the min-cost flow to alleviate the high computation cost of the earth mover’s distance.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
NSysS 2021: Proceedings of the 8th International Conference on Networking, Systems and Security
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Differential privacy (DP) is a widely used notion for reasoning about privacy when publishing aggregate data. In this paper, we observe that certain DP mechanisms are amenable to a posteriori privacy analysis that exploits the fact that some outputs leak less information about the input database than others. To exploit this phenomenon, we introduce output differential privacy (ODP) and a new composition experiment, and leverage these new constructs to obtain significant privacy budget savings and improved privacy–utility tradeoffs under composition. All of this comes at no cost in terms of privacy; we do not weaken the privacy guarantee. To demonstrate the applicability of our a posteriori privacy analysis techniques, we analyze two well-known mechanisms: the Sparse Vector Technique and the Propose-Test-Release framework. We then show how our techniques can be used to save privacy budget in more general contexts: when a differentially private iterative mechanism terminates before its maximal number of iterations is reached, and when the output of a DP mechanism provides unsatisfactory utility. Examples of the former include iterative optimization algorithms, whereas examples of the latter include training a machine learning model with a large generalization error. Our techniques can be applied beyond the current paper to refine the analysis of existing DP mechanisms or guide the design of future mechanisms. 
    more » « less
  2. Label differential privacy is a relaxation of differential privacy for machine learning scenarios where the labels are the only sensitive information that needs to be protected in the training data. For example, imagine a survey from a participant in a university class about their vaccination status. Some attributes of the students are publicly available but their vaccination status is sensitive information and must remain private. Now if we want to train a model that predicts whether a student has received vaccination using only their public information, we can use label-DP. Recent works on label-DP use different ways of adding noise to the labels in order to obtain label-DP models. In this work, we present novel techniques for training models with label-DP guarantees by leveraging unsupervised learning and semi-supervised learning, enabling us to inject less noise while obtaining the same privacy, therefore achieving a better utility-privacy trade-off. We first introduce a framework that starts with an unsupervised classifier f0 and dataset D with noisy label set Y , reduces the noise in Y using f0 , and then trains a new model f using the less noisy dataset. Our noise reduction strategy uses the model f0 to remove the noisy labels that are incorrect with high probability. Then we use semi-supervised learning to train a model using the remaining labels. We instantiate this framework with multiple ways of obtaining the noisy labels and also the base classifier. As an alternative way to reduce the noise, we explore the effect of using unsupervised learning: we only add noise to a majority voting step for associating the learned clusters with a cluster label (as opposed to adding noise to individual labels); the reduced sensitivity enables us to add less noise. Our experiments show that these techniques can significantly outperform the prior works on label-DP. 
    more » « less
  3. Differential Privacy (DP) formalizes privacy in mathematical terms and provides a robust concept for privacy protection. DIfferentially Private Data Synthesis (DIPS) techniques produce and release synthetic individual-level data in the DP framework. One key challenge to develop DIPS methods is the preservation of the statistical utility of synthetic data, especially in high-dimensional settings. We propose a new DIPS approach, STatistical Election to Partition Sequentially (STEPS) that partitions data by attributes according to their importance ranks according to either a practical or statistical importance measure. STEPS aims to achieve better original information preservation for the attributes with higher importance ranks and produce thus more useful synthetic data overall. We present an algorithm to implement the STEPS procedure and employ the privacy budget composability to ensure the overall privacy cost is controlled at the pre-specified value. We apply the STEPS procedure to both simulated data and the 2000–2012 Current Population Survey youth voter data. The results suggest STEPS can better preserve the population-level information and the original information for some analyses compared to PrivBayes, a modified Uniform histogram approach, and the flat Laplace sanitizer. 
    more » « less
  4. Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.

    more » « less
  5. Precup, Doina ; Chandar, Sarath ; Pascanu, Razvan (Ed.)
    In this paper, we show that the process of continually learning new tasks and memorizing previous tasks introduces unknown privacy risks and challenges to bound the privacy loss. Based upon this, we introduce a formal definition of Lifelong DP, in which the participation of any data tuples in the training set of any tasks is protected, under a consistently bounded DP protection, given a growing stream of tasks. A consistently bounded DP means having only one fixed value of the DP privacy budget, regardless of the number of tasks. To preserve Lifelong DP, we propose a scalable and heterogeneous algorithm, called L2DP-ML with a streaming batch training, to efficiently train and continue releasing new versions of an L2M model, given the heterogeneity in terms of data sizes and the training order of tasks, without affecting DP protection of the private training set. An end-to-end theoretical analysis and thorough evaluations show that our mechanism is significantly better than baseline approaches in preserving Lifelong DP. The implementation of L2DP-ML is available at: 
    more » « less