skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SeqPATE: Differentially Private Text Generation via Knowledge Distillation
Protecting the privacy of user data is crucial for text generation models, which can leak sensitive information during generation. Differentially private (DP) learning methods provide guarantees against identifying the existence of a training sample from model outputs. PATE is a recent DP learning algorithm that achieves high utility with strong privacy protection on training samples. However, text generation models output tokens sequentially in a large output space; the classic PATE algorithm is not customized for this setting. Furthermore, PATE works well to protect sample-level privacy, but is not designed to protect phrases in samples. In this paper, we propose SeqPATE, an extension of PATE to text generation that protects the privacy of individual training samples and sensitive phrases in training data. To adapt PATE to text generation, we generate pseudo-contexts and reduce the sequence generation problem to a next-word prediction problem. To handle the large output space, we propose a candidate filtering strategy to dynamically reduce the output space, and refine the teacher aggregation of PATE to avoid low agreement due to voting for a large number of candidates. To further reduce privacy losses, we use knowledge distillation to reduce the number of teacher queries. The experiments verify the effectiveness of SeqPATE in protecting both training samples and sensitive phrases.  more » « less
Award ID(s):
2048091
PAR ID:
10397827
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Private Aggregation of Teacher Ensembles (PATE) is a machine learning framework that enables the creation of private models through the combination of multiple teacher models and a student model. The student model learns to predict an output based on the voting of the teachers, and the resulting model satisfies differential privacy. PATE has been shown to be effective in creating private models in semi-supervised settings or when protecting data labels is a priority.This paper explores whether the use of PATE can result in unfairness, and demonstrates that it can lead to accuracy disparities among groups of individuals. The paper also analyzes the algorithmic and data properties that contribute to these disproportionate impacts, why these aspects are affecting different groups disproportionately, and offers recommendations for mitigating these effects. 
    more » « less
  2. The ''Propose-Test-Release'' (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from ''Private Aggregation of Teacher Ensembles (PATE)'' -- privately releasing the entire model with a delicate data-dependent analysis. 
    more » « less
  3. We provide an end-to-end Renyi DP based-framework for differentially private top-π‘˜ selection. Unlike previous approaches, which require a data-independent choice on π‘˜, we propose to privately release a data-dependent choice of π‘˜ such that the gap between π‘˜-th and the (π‘˜+1)st β€œquality” is large. This is achieved by an extension of the Report-Noisy-Max algorithm with a more concentrated Gaussian noise. Not only does this eliminates one hyperparameter, the adaptive choice of π‘˜ also certifies the stability of the top-π‘˜ indices in the unordered set so we can release them using a combination of the propose-test-release (PTR) framework and the Distance-to-Stability mechanism. We show that our construction improves the privacy-utility trade-offs compared to the previous top-π‘˜ selection algorithms theoretically and empirically. Additionally, we apply our algorithm to β€œPrivate Aggregation of Teacher Ensembles (PATE)” in multi-label classification tasks with a large number of labels and show that it leads to significant performance gains. 
    more » « less
  4. Abstract Machine learning (ML) models are used for in-situ monitoring in additive manufacturing (AM) for defect detection. However, sensitive information stored in ML models, such as part designs, is at risk of data leakage due to unauthorized access. To address this, differential privacy (DP) introduces noise into ML, outperforming cryptography, which is slow, and data anonymization, which does not guarantee privacy. While DP enhances privacy, it reduces the precision of defect detection. This paper proposes combining DP with Hyperdimensional Computing (HDC), a brain-inspired model that memorizes training sample information in a large hyperspace, to optimize real-time monitoring in AM while protecting privacy. Adding DP noise to the HDC model protects sensitive information without compromising defect detection accuracy. Our studies demonstrate the effectiveness of this approach in monitoring anomalies, such as overhangs, using high-speed melt pool data analysis. With a privacy budget set at 1, our model achieved an F-score of 94.30%, surpassing traditional models like ResNet50, DenseNet201, EfficientNet B2, and AlexNet, which have performance up to 66%. Thus, the intersection of DP and HDC promises accurate defect detection and protection of sensitive information in AM. The proposed method can also be extended to other AM processes, such as fused filament fabrication. 
    more » « less
  5. Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named AUGPE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that AUGPE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe. 
    more » « less