Applying large language models to sanitize self-disclosure in user-generated content

Alfieri, Costanza; Scoccia, Gian Luca; Ganesh, Surya; Sadeh, Norman

doi:10.1016/j.asoc.2025.113311

Citation Details

This content will become publicly available on June 1, 2026

Applying large language models to sanitize self-disclosure in user-generated content

The rise of e-commerce and social networking platforms has led to an increase in the disclosure of personal health information within user-generated content. This study investigates the application of large language models (LLMs) to detect and sanitize sensitive health data shared by users across platforms such as Amazon, patient.info, and Facebook. We propose a methodology that leverages LLMs to evaluate both the sensitivity of disclosed information and the platform-specific semantics of the content. Through prompt engineering, our method identifies sensitive information and rephrases it to minimize disclosure while preserving content similarity. ChatGPT serves as the LLM in this study due to its versatility. Empirical results suggest that ChatGPT can reliably assign sensitivity scores to user-generated text and generate sanitized versions that effectively preserve the original meaning. more »

Award ID(s):: 1914486

PAR ID:: 10598513

Author(s) / Creator(s):: Alfieri, Costanza; Scoccia, Gian Luca; Ganesh, Surya; Sadeh, Norman

Publisher / Repository:: Applied Soft Computing, Elsevier

Date Published:: 2025-06-01

Journal Name:: Applied Soft Computing

ISSN:: 1568-4946

Page Range / eLocation ID:: 113311

Subject(s) / Keyword(s):: LLMs Sanitizing Self-disclosure Sensitivity detection Privacy Privacy enhancing technologie ChatGPT

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on June 1, 2026
Journal Article:
https://doi.org/10.1016/j.asoc.2025.113311

More Like this