skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits
The development of tools and techniques to analyze and extract organizations’ data habits from privacy policies are critical for scalable regulatory compliance audits. Unfortunately, these tools are becoming increasingly limited in their ability to identify compliance issues and fixes. After all, most were developed using regulationagnostic datasets of annotated privacy policies obtained from a time before the introduction of landmark privacy regulations such as EU’s GDPR and California’s CCPA. In this paper, we describe the first open regulation-aware dataset of expert-annotated privacy policies, C3PA (CCPA Privacy Policy Provision Annotations), aimed to address this challenge. C3PA contains over 48K expert-labeled privacy policy text segments associated with responses to CCPA-specific disclosure mandates from 411 unique organizations. We demonstrate that the C3PA dataset is uniquely suited for aiding automated audits of compliance with CCPA-related disclosure mandates.  more » « less
Award ID(s):
2338377 2335659 1953983
PAR ID:
10612962
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
3710 to 3722
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data privacy policy requirements are a quickly evolving part of the data management domain. Healthcare (e.g., HIPAA), financial (e.g., GLBA), and general laws such as GDPR or CCPA impose controls on how personal data should be managed. Relational databases do not offer built-in features to support data management features to comply with such laws. As a result, many organizations implement ad-hoc solutions or use third party tools to ensure compliance with privacy policies. However, external compliance framework can conflict with the internal activity in a database (e.g., trigger side-effects or aborted transactions). In our prior work, we introduced a framework that integrates data retention and data purging compliance into the database itself, requiring only the support for triggers and encryption, which are already available in any mainstream database engine. In this demonstration paper, we introduce DBCompliant – a tool that demonstrates how our approach can seamlessly integrate comprehensive policy compliance (defined via SQL queries). Although we use PostgreSQL as our back-end, DBCompliant could be adapted to any other relational database. Finally, our approach imposes low (less than 5%) user query overhead. 
    more » « less
  2. Regulatory documents are complex and lengthy, making full compliance a challenging task for businesses. Similarly, privacy policies provided by vendors frequently fall short of the necessary legal standards due to insufficient detail. To address these issues, we propose a solution that leverages a Large Language Model (LLM) in combination with Semantic Web technology. This approach aims to clarify regulatory requirements and ensure that organizations’ privacy policies align with the relevant legal frameworks, ultimately simplifying the compliance process, reducing privacy risks, and improving efficiency. In this paper, we introduce a novel tool, the Privacy Policy Compliance Verification Knowledge Graph, referred to as PrivComp-KG. PrivComp-KG is designed to efficiently store and retrieve comprehensive information related to privacy policies, regulatory frameworks, and domain-specific legal knowledge. By utilizing LLM and Retrieval Augmented Generation (RAG), we can accurately identify relevant sections in privacy policies and map them to the corresponding regulatory rules. Our LLM-based retrieval system has demonstrated a high level of accuracy, achieving a correctness score of 0.9, outperforming other models in privacy policy analysis. The extracted information from individual privacy policies is then integrated into the PrivComp-KG. By combining this data with contextual domain knowledge and regulatory rules, PrivComp-KG can be queried to assess each vendor’s compliance with applicable regulations. We demonstrate the practical utility of PrivComp-KG by verifying the compliance of privacy policies across various organizations. This approach not only helps policy writers better understand legal requirements but also enables them to identify gaps in existing policies and update them in response to evolving regulations. 
    more » « less
  3. Privacy policies are often lengthy and complex legal documents, and are difficult for many people to read and comprehend. Recent research efforts have explored automated assistants that process the language in policies and answer people’s privacy questions. This study documents the importance of two different types of reasoning necessary to generate accurate answers to people’s privacy questions. The first is the need to support taxonomic reasoning about related terms commonly found in privacy policies. The second is the need to reason about regulatory disclosure requirements, given the prevalence of silence in privacy policy texts. Specifically, we report on a study involving the collection of 749 sets of expert annotations to answer privacy questions in the context of 210 different policy/question pairs. The study highlights the importance of taxonomic reasoning and of reasoning about regulatory disclosure requirements when it comes to accurately answering everyday privacy questions. Next we explore to what extent current generative AI tools are able to reliably handle this type of reasoning. Our results suggest that in their current form and in the absence of additional help, current models cannot reliably support the type of reasoning about regulatory disclosure requirements necessary to accurately answer privacy questions. We proceed to introduce and evaluate different approaches to improving their performance. Through this work, we aim to provide a richer understanding of the capabilities automated systems need to have to provide accurate answers to everyday privacy questions and, in the process, outline paths for adapting AI models for this purpose. 
    more » « less
  4. The General Data Protection Regulation (GDPR) and other recent privacy laws require organizations to post their privacy policies, and place specific expectations on organisations' privacy practices. Privacy policies take the form of documents written in natural language, and one of the expectations placed upon them is that they remain up to date. To investigate legal compliance with this recency requirement at a large scale, we create a novel pipeline that includes crawling, regex-based extraction, candidate date classification and date object creation to extract updated and effective dates from privacy policies written in English. We then analyze patterns in policy dates using four web crawls and find that only about 40% of privacy policies online contain a date, thereby making it difficult to assess their regulatory compliance. We also find that updates in privacy policies are temporally concentrated around passage of laws regulating digital privacy (such as the GDPR), and that more popular domains are more likely to have policy dates as well as more likely to update their policies regularly. 
    more » « less
  5. Most organizations rely on relational database(s) for their day-to-day business functions. Data management policies fall under the umbrella of IT Operations, dictated by a combination of internal organizational policies and government regulations. Many privacy laws (such as Europe’s General Data Protection Regulation and California’s Consumer Privacy Act) establish policy requirements for organizations, requiring the preservation or purging of certain customer data across their systems. Organization disaster recovery policies also mandate backup policies to prevent data loss. Thus, the data in these databases are subject to a range of policies, including data retention and data purging rules, which may come into conflict with the need for regular backups. In this paper, we discuss the trade-offs between different compliance mechanisms to maintain IT Operational policies. We consider the practical availability of data in an active relational database and in a backup, including: 1) supporting data privacy rules with respect to preserving or purging customer data, and 2) the application performance impact caused by the database policy implementation. We first discuss the state of data privacy compliance in database systems. We then look at enforcement of common IT operational policies with regard to database backups. We consider different implementations used to enforce privacy rule compliance combined with a detailed discussion for how these approaches impact the performance of a database at different phases. We demonstrate that naive compliance implementations will incur a prohibitively high cost and impose onerous restrictions on backup and restore process, but will not affect daily user query transaction cost. However, we also show that other solutions can achieve a far lower backup and restore costs at a price of a small (<5%) overhead to non-SELECT queries. 
    more » « less