skip to main content


Search for: All records

Award ID contains: 2048091

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. In this article, we present a detailed review of current practices and state-of-the-art methodologies in the field of differential privacy (DP), with a focus of advancing DP’s deployment in real-world applications. Key points and high-level contents of the article were originated from the discussions from “Differential Privacy (DP): Challenges Towards the Next Frontier,” a workshop held in July 2022 with experts from industry, academia, and the public sector seeking answers to broad questions pertaining to privacy and its implications in the design of industry-grade systems.This article aims to provide a reference point for the algorithmic and design decisions within the realm of privacy, highlighting important challenges and potential research directions. Covering a wide spectrum of topics, this article delves into the infrastructure needs for designing private systems, methods for achieving better privacy/utility trade-offs, performing privacy attacks and auditing, as well as communicating privacy with broader audiences and stakeholders. 
    more » « less
    Free, publicly-accessible full text available January 31, 2025
  2. Data valuation, a growing field that aims at quantifying the usefulness of individual data sources for training machine learning (ML) models, faces notable yet often overlooked privacy challenges. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical challenges in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley. Moreover, even non-private TKNN-Shapley matches KNN-Shapley's performance in discerning data quality. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data. 
    more » « less
    Free, publicly-accessible full text available December 7, 2024
  3. The offline reinforcement learning (RL) problem is often motivated by the need to learn data-driven decision policies in financial, legal and healthcare applications. However, the learned policy could retain sensitive information of individuals in the training data (e.g., treatment and outcome of patients), thus susceptible to various privacy risks. We design offline RL algorithms with differential privacy guarantees which provably prevent such risks. These algorithms also enjoy strong instance-dependent learning bounds under both tabular and linear Markov Decision Process (MDP) settings. Our theory and simulation suggest that the privacy guarantee comes at (almost) no drop in utility comparing to the non-private counterpart for a medium-size dataset. 
    more » « less
    Free, publicly-accessible full text available December 7, 2024
  4. In the arena of privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) has outstripped the objective perturbation mechanism in popularity and interest. Though unrivaled in versatility, DP-SGD requires a non-trivial privacy overhead (for privately tuning the model’s hyperparameters) and a computational complexity which might be extravagant for simple models such as linear and logistic regression. This paper revamps the objective perturbation mechanism with tighter privacy analyses and new computational tools that boost it to perform competitively with DP-SGD on unconstrained convex generalized linear problems. 
    more » « less
    Free, publicly-accessible full text available December 7, 2024
  5. Evans, Robin J. ; Shpitser, Ilya (Ed.)
    Most existing approaches of differentially private (DP) machine learning focus on private training. Despite its many advantages, private training lacks the flexibility in adapting to incremental changes to the training dataset such as deletion requests from exercising GDPR’s right to be forgotten. We revisit a long-forgotten alternative, known as private prediction, and propose a new algorithm named Individual Kernelized Nearest Neighbor (Ind-KNN). Ind-KNN is easily updatable over dataset changes and it allows precise control of the Rényi DP at an individual user level — a user’s privacy loss is measured by the exact amount of her contribution to predictions; and a user is removed if her prescribed privacy budget runs out. Our results show that Ind-KNN consistently improves the accuracy over existing private prediction methods for a wide range of epsilon on four vision and language tasks. We also illustrate several cases under which Ind-KNN is preferable over private training with NoisySGD. 
    more » « less
    Free, publicly-accessible full text available July 31, 2024
  6. Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an ϵ-JDP algorithm with a regret of O˜(sqrt(SAH^2T) +S^2AH^3/ϵ) which matches the information-theoretic lower bound of non-private learning for all choices of ϵ>S^1.5A^0.5H^2/sqrt(T). In the above, S, A denote the number of states and actions, H denotes the planning horizon, and T is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as T→∞. Our techniques -- which could be of independent interest -- include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case. 
    more » « less
  7. The ''Propose-Test-Release'' (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from ''Private Aggregation of Teacher Ensembles (PATE)'' -- privately releasing the entire model with a delicate data-dependent analysis. 
    more » « less
  8. Protecting the privacy of user data is crucial for text generation models, which can leak sensitive information during generation. Differentially private (DP) learning methods provide guarantees against identifying the existence of a training sample from model outputs. PATE is a recent DP learning algorithm that achieves high utility with strong privacy protection on training samples. However, text generation models output tokens sequentially in a large output space; the classic PATE algorithm is not customized for this setting. Furthermore, PATE works well to protect sample-level privacy, but is not designed to protect phrases in samples. In this paper, we propose SeqPATE, an extension of PATE to text generation that protects the privacy of individual training samples and sensitive phrases in training data. To adapt PATE to text generation, we generate pseudo-contexts and reduce the sequence generation problem to a next-word prediction problem. To handle the large output space, we propose a candidate filtering strategy to dynamically reduce the output space, and refine the teacher aggregation of PATE to avoid low agreement due to voting for a large number of candidates. To further reduce privacy losses, we use knowledge distillation to reduce the number of teacher queries. The experiments verify the effectiveness of SeqPATE in protecting both training samples and sensitive phrases. 
    more » « less
  9. Linear sketches have been widely adopted to process fast data streams, and they can be used to accurately answer frequency estimation, approximate top K items, and summarize data distributions. When data are sensitive, it is desirable to provide privacy guarantees for linear sketches to preserve private information while delivering useful results with theoretical bounds. We show that linear sketches can ensure privacy and maintain their unique properties with a small amount of noise added at initialization. From the differentially private linear sketches, we showcase that the state-of-the-art quantile sketch in the turnstile model can also be private and maintain high performance. Experiments further demonstrate that our proposed differentially private sketches are quantitatively and qualitatively similar to noise-free sketches with high utilization on synthetic and real datasets. 
    more » « less
  10. Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality. 
    more » « less