skip to main content

Title: Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text
Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many privacy policies bury these instructions deep in their text, and few web users have the time or skill necessary to discover them. We describe a method for the automated detection of opt-out choices in privacy policy text and their presentation to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable the training of classifiers to identify opt-outs in privacy policies. Our overall approach for extracting and classifying opt-out choices combines heuristics to identify commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We evaluate the usability of our browser extension with a user study. We also present results of a large-scale analysis of opt-outs found in the text of thousands of the most popular websites.
Authors:
; ; ; ; ; ; ; ; ; ; ;
Award ID(s):
1914486
Publication Date:
NSF-PAR ID:
10169862
Journal Name:
WWW '20: Proceedings of the Web Conference 2020
Page Range or eLocation-ID:
1943 to 1954
Sponsoring Org:
National Science Foundation
More Like this
  1. Increasingly, icons are being proposed to concisely convey privacy-related information and choices to users. However, complex privacy concepts can be difficult to communicate. We investigate which icons effectively signal the presence of privacy choices. In a series of user studies, we designed and evaluated icons and accompanying textual descriptions (link texts) conveying choice, opting-out, and sale of personal information — the latter an opt-out mandated by the California Consumer Privacy Act (CCPA). We identified icon-link text pairings that conveyed the presence of privacy choices without creating misconceptions, with a blue stylized toggle icon paired with “Privacy Options” performing best. Themore »two CCPA-mandated link texts (“Do Not Sell My Personal Information” and “Do Not Sell My Info”) accurately communicated the presence of do-not-sell opt-outs with most icons. Our results provide insights for the design of privacy choice indicators and highlight the necessity of incorporating user testing into policy making.« less
  2. Increasingly, icons are being proposed to concisely convey privacyrelated information and choices to users. However, complex privacy concepts can be difcult to communicate. We investigate which icons efectively signal the presence of privacy choices. In a series of user studies, we designed and evaluated icons and accompanying textual descriptions (link texts) conveying choice, opting-out, and sale of personal information — the latter an opt-out mandated by the California Consumer Privacy Act (CCPA). We identifed icon-link text pairings that conveyed the presence of privacy choices without creating misconceptions, with a blue stylized toggle icon paired with “Privacy Options” performing best. Themore »two CCPA-mandated link texts (“Do Not Sell My Personal Information” and “Do Not Sell My Info”) accurately communicated the presence of do-notsell opt-outs with most icons. Our results provide insights for the design of privacy choice indicators and highlight the necessity of incorporating user testing into policy making.« less
  3. Browser users encounter a broad array of potentially intrusive practices: from behavioral profiling, to crypto-mining, fingerprinting, and more. We study people’s perception, awareness, understanding, and preferences to opt out of those practices. We conducted a mixed-methods study that included qualitative (n=186) and quantitative (n=888) surveys covering 8 neutrally presented practices, equally highlighting both their benefits and risks. Consistent with prior research focusing on specific practices and mitigation techniques, we observe that most people are unaware of how to effectively identify or control the practices we surveyed. However, our user-centered approach reveals diverse views about the perceived risks and benefits, andmore »that the majority of our participants wished to both restrict and be explicitly notified about the surveyed practices. Though prior research shows that meaningful controls are rarely available, we found that many participants mistakenly assume opt-out settings are common but just too difficult to find. However, even if they were hypothetically available on every website, our findings suggest that settings which allow practices by default are more burdensome to users than alternatives which are contextualized to website categories instead. Our results argue for settings which can distinguish among website categories where certain practices are seen as permissible, proactively notify users about their presence, and otherwise deny intrusive practices by default. Standardizing these settings in the browser rather than being left to individual websites would have the advantage of providing a uniform interface to support notification, control, and could help mitigate dark patterns. We also discuss the regulatory implications of the findings.« less
  4. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The numbermore »of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.« less
  5. Abstract Over half of all visits to websites now take place in a mobile browser, yet the majority of web privacy studies take the vantage point of desktop browsers, use emulated mobile browsers, or focus on just a single mobile browser instead. In this paper, we present a comprehensive web-tracking measurement study on mobile browsers and privacy-focused mobile browsers. Our study leverages a new web measurement infrastructure, OmniCrawl, which we develop to drive browsers on desktop computers and smartphones located on two continents. We capture web tracking measurements using 42 different non-emulated browsers simultaneously. We find that the third-party advertisingmore »and tracking ecosystem of mobile browsers is more similar to that of desktop browsers than previous findings suggested. We study privacy-focused browsers and find their protections differ significantly and in general are less for lower-ranked sites. Our findings also show that common methodological choices made by web measurement studies, such as the use of emulated mobile browsers and Selenium, can lead to website behavior that deviates from what actual users experience.« less