skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text
Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many privacy policies bury these instructions deep in their text, and few web users have the time or skill necessary to discover them. We describe a method for the automated detection of opt-out choices in privacy policy text and their presentation to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable the training of classifiers to identify opt-outs in privacy policies. Our overall approach for extracting and classifying opt-out choices combines heuristics to identify commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We evaluate the usability of our browser extension with a user study. We also present results of a large-scale analysis of opt-outs found in the text of thousands of the most popular websites.  more » « less
Award ID(s):
1914486
PAR ID:
10169862
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
WWW '20: Proceedings of the Web Conference 2020
Page Range / eLocation ID:
1943 to 1954
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mazurek, Michelle ; Sher, Micah (Ed.)

    Web tracking by ad networks and other data-driven businesses is often privacy-invasive. Privacy laws, such as the California Consumer Privacy Act, aim to give people more control over their data. In particular, they provide a right to opt out from web tracking via privacy preference signals, notably Global Privacy Control (GPC). GPC holds the promise of enabling people to exercise their opt out rights on the web. Broad adoption of GPC hinges on its usability. In a usability survey we find that 94% of the participants would turn on GPC indicating a need for such efficient and effective opt out mechanism. 81% of the participants in our survey also have a correct understanding of what GPC does ensuring that their intent is accurately represented by their choice. The effectiveness of GPC is dependent on whether websites' GPC compliance can be enforced. A site's GPC compliance can be analyzed based on privacy flags, such as the US Privacy String, which is used on many sites to indicate the opt out status of a web user. Leveraging the US Privacy String for GPC purposes we implement a proof-of-concept browser extension that successfully and correctly analyzes sites' GPC compliance at a rate of 89%. We further implement a web crawler for our browser extension demonstrating that our analysis approach is scalable. We find that many sites do not respect GPC opt out signals despite being legally obligated to do so. Only 54/464 (12%) sites with a US Privacy String opt out users after having received a GPC signal.

     
    more » « less
  2. null (Ed.)
    Increasingly, icons are being proposed to concisely convey privacy-related information and choices to users. However, complex privacy concepts can be difficult to communicate. We investigate which icons effectively signal the presence of privacy choices. In a series of user studies, we designed and evaluated icons and accompanying textual descriptions (link texts) conveying choice, opting-out, and sale of personal information — the latter an opt-out mandated by the California Consumer Privacy Act (CCPA). We identified icon-link text pairings that conveyed the presence of privacy choices without creating misconceptions, with a blue stylized toggle icon paired with “Privacy Options” performing best. The two CCPA-mandated link texts (“Do Not Sell My Personal Information” and “Do Not Sell My Info”) accurately communicated the presence of do-not-sell opt-outs with most icons. Our results provide insights for the design of privacy choice indicators and highlight the necessity of incorporating user testing into policy making. 
    more » « less
  3. null (Ed.)
    Increasingly, icons are being proposed to concisely convey privacyrelated information and choices to users. However, complex privacy concepts can be difcult to communicate. We investigate which icons efectively signal the presence of privacy choices. In a series of user studies, we designed and evaluated icons and accompanying textual descriptions (link texts) conveying choice, opting-out, and sale of personal information — the latter an opt-out mandated by the California Consumer Privacy Act (CCPA). We identifed icon-link text pairings that conveyed the presence of privacy choices without creating misconceptions, with a blue stylized toggle icon paired with “Privacy Options” performing best. The two CCPA-mandated link texts (“Do Not Sell My Personal Information” and “Do Not Sell My Info”) accurately communicated the presence of do-notsell opt-outs with most icons. Our results provide insights for the design of privacy choice indicators and highlight the necessity of incorporating user testing into policy making. 
    more » « less
  4. The California Consumer Privacy Act and other privacy laws give people a right to opt out of the sale and sharing of personal information. In combination with privacy preference signals, especially, Global Privacy Control (GPC), such rights have the potential to empower people to assert control over their data. However, many laws prohibit opt out settings being turned on by default. The resulting usability challenges for people to exercise their rights motivate generalizable active privacy choice --- an interface design principle to make opt out settings usable without defaults. It is based on the idea of generalizing one individual opt out choice towards a larger set of choices. For example, people may apply an opt out choice on one site towards a larger set of sites. We explore generalizable active privacy choice in the context of GPC. We design and implement nine privacy choice schemes in a browser extension and explore them in a usability study with 410 participants. We find that generalizability features tend to decrease opt out utility slightly. However, at the same time, they increase opt out efficiency and make opting out less disruptive, which was more important to most participants. For the least disruptive scheme, selecting website categories to opt out from, 98% of participants expressed not feeling disrupted, a 40% point increase over the baseline schemes. 83% of participants understood the meaning of GPC. They also made their opt out choices with intent and, thus, in a legally relevant manner. To help people exercise their opt out rights via GPC our results support the adoption of a generalizable active privacy choice interface in web browsers.

     
    more » « less
  5. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less