skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automated Detection and Analysis of Data Practices Using A Real-World Corpus
Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users.  more » « less
Award ID(s):
2105736 2237574
PAR ID:
10599512
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
4567 to 4574
Format(s):
Medium: X
Location:
Bangkok, Thailand and virtual meeting
Sponsoring Org:
National Science Foundation
More Like this
  1. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less
  2. The European General Data Protection Regulation (GDPR) mandates a data controller (e.g., an app developer) to provide all information specified in Articles (Arts.) 13 and 14 to data subjects (e.g., app users) regarding how their data are being processed and what are their rights. While some studies have started to detect the fulfillment of GDPR requirements in a privacy policy, their exploration only focused on a subset of mandatory GDPR requirements. In this paper, our goal is to explore the state of GDPR-completeness violations in mobile apps' privacy policies. To achieve our goal, we design the PolicyChecker framework by taking a rule and semantic role based approach. PolicyChecker automatically detects completeness violations in privacy policies based not only on all mandatory GDPR requirements but also on all if-applicable GDPR requirements that will become mandatory under specific conditions. Using PolicyChecker, we conduct the first large-scale GDPR-completeness violation study on 205,973 privacy policies of Android apps in the UK Google Play store. PolicyChecker identified 163,068 (79.2%) privacy policies containing data collection statements; therefore, such policies are regulated by GDPR requirements. However, the majority (99.3%) of them failed to achieve the GDPR-completeness with at least one unsatisfied requirement; 98.1% of them had at least one unsatisfied mandatory requirement, while 73.0% of them had at least one unsatisfied if-applicable requirement logic chain. We conjecture that controllers' lack of understanding of some GDPR requirements and their poor practices in composing a privacy policy can be the potential major causes behind the GDPR-completeness violations. We further discuss recommendations for app developers to improve the completeness of their apps' privacy policies to provide a more transparent personal data processing environment to users. 
    more » « less
  3. Companies' privacy policies and their contents are being analyzed for many reasons, including to assess the readability, usability, and utility of privacy policies; to extract and analyze data practices of apps and websites; to assess compliance of companies with relevant laws and their own privacy policies, and to develop tools and machine learning models to summarize and read policies. Despite the importance and interest in studying privacy policies from researchers, regulators, and privacy activists, few best practices or approaches have emerged and infrastructure and tool support is scarce or scattered. In order to provide insight into how researchers study privacy policies and the challenges they face when doing so, we conducted 26 interviews with researchers from various disciplines who have conducted research on privacy policies. We provide insights on a range of challenges around policy selection, policy retrieval, and policy content analysis, as well as multiple overarching challenges researchers experienced across the research process. Based on our findings, we discuss opportunities to better facilitate privacy policy research, including research directions for methodologically advancing privacy policy analysis, potential structural changes around privacy policies, and avenues for fostering an interdisciplinary research community and maturing the field. 
    more » « less
  4. Furnell, Steven (Ed.)
    A huge amount of personal and sensitive data is shared on Facebook, which makes it a prime target for attackers. Adversaries can exploit third-party applications connected to a user’s Facebook profile (i.e., Facebook apps) to gain access to this personal information. Users’ lack of knowledge and the varying privacy policies of these apps make them further vulnerable to information leakage. However, little has been done to identify mismatches between users’ perceptions and the privacy policies of Facebook apps. We address this challenge in our work. We conducted a lab study with 31 participants, where we received data on how they share information in Facebook, their Facebook-related security and privacy practices, and their perceptions on the privacy aspects of 65 frequently-used Facebook apps in terms of data collection, sharing, and deletion. We then compared participants’ perceptions with the privacy policy of each reported app. Participants also reported their expectations about the types of information that should not be collected or shared by any Facebook app. Our analysis reveals significant mismatches between users’ privacy perceptions and reality (i.e., privacy policies of Facebook apps), where we identified over-optimism not only in users’ perceptions of information collection, but also on their self-efficacy in protecting their information in Facebook despite experiencing negative incidents in the past. To the best of our knowledge, this is the first study on the gap between users’ privacy perceptions around Facebook apps and the reality. The findings from this study offer directions for future research to address that gap through designing usable, effective, and personalized privacy notices to help users to make informed decisions about using Facebook apps. 
    more » « less
  5. Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many privacy policies bury these instructions deep in their text, and few web users have the time or skill necessary to discover them. We describe a method for the automated detection of opt-out choices in privacy policy text and their presentation to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable the training of classifiers to identify opt-outs in privacy policies. Our overall approach for extracting and classifying opt-out choices combines heuristics to identify commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We evaluate the usability of our browser extension with a user study. We also present results of a large-scale analysis of opt-outs found in the text of thousands of the most popular websites. 
    more » « less