skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A large-scale exploration of terms of service documents on the web
Terms of service documents are a common feature of organizations' websites. Although there is no blanket requirement for organizations to provide these documents, their provision often serves essential legal purposes. Users of a website are expected to agree with the contents of a terms of service document, but users tend to ignore these documents as they are often lengthy and difficult to comprehend. As a step towards understanding the landscape of these documents at a large scale, we present a first-of-its-kind terms of service corpus containing 247,212 English language terms of service documents obtained from company websites sampled from Free Company Dataset. We examine the URLs and contents of the documents and find that some websites that purport to post terms of service actually do not provide them. We analyze reasons for unavailability and determine the overall availability of terms of service in a given set of website domains. We also identify that some websites provide an agreement that combines terms of service with a privacy policy, which is often an obligatory separate document. Using topic modeling, we analyze the themes in these combined documents by comparing them with themes found in separate terms of service and privacy policies. Results suggest that such single-page agreements miss some of the most prevalent topics available in typical privacy policies and terms of service documents and that many disproportionately cover privacy policy topics as compared to terms of service topics.  more » « less
Award ID(s):
2105736
PAR ID:
10333523
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 21st ACM Symposium on Document Engineering
Page Range / eLocation ID:
1 to 4
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less
  2. null (Ed.)
    Cloud Legal documents, like Privacy Policies and Terms of Services (ToS), include key terms and rules that enable consumers to continuously monitor the performance of the cloud services used in their organization. To ensure high consumer confidence in the cloud service, it is necessary that these documents are clear and comprehensible to the average consumer. However, in practice, service providers often use legalese and ambiguous language in cloud legal documents resulting in consumers consenting or rejecting the terms without understanding the details. A measure capturing ambiguity in the texts of cloud service documents will enable consumers to decide if they understand what they are agreeing to, and deciding whether that service will meet their organizational requirements. It will also allow them to compare the service policies across various vendors. We have developed a novel model, ViCLOUD, that defines a scoring method based on linguistic cues to measure ambiguity in cloud legal documents and compare them to other peer websites. In this paper, we describe the ViCLOUD model in detail along with the validation results when applying it to 112 privacy policies and 108 Terms of Service documents of 115 cloud service vendors. The score distribution gives us a landscape of current trends in cloud services and a scale of comparison for new documentation. Our model will be very useful to organizations in making judicious decisions when selecting their cloud service. 
    more » « less
  3. Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date. 
    more » « less
  4. Villata, S. (Ed.)
    The European Union’s General Data Protection Regulation (GDPR) has compelled businesses and other organizations to update their privacy policies to state specific information about their data practices. Simultaneously, researchers in natural language processing (NLP) have developed corpora and annotation schemes for extracting salient information from privacy policies, often independently of specific laws. To connect existing NLP research on privacy policies with the GDPR, we introduce a mapping from GDPR provisions to the OPP-115 annotation scheme, which serves as the basis for a growing number of projects to automatically classify privacy policy text. We show that assumptions made in the annotation scheme about the essential topics for a privacy policy reflect many of the same topics that the GDPR requires in these documents. This suggests that OPP-115 continues to be representative of the anatomy of a legally compliant privacy policy, and that the legal assumptions behind it represent the elements of data processing that ought to be disclosed within a policy for transparency. The correspondences we show between OPP-115 and the GDPR suggest the feasibility of bridging existing computational and legal research on privacy policies, benefiting both areas. 
    more » « less
  5. null (Ed.)
    The European Union’s General Data Protection Regulation (GDPR) has compelled businesses and other organizations to update their privacy policies to state specific information about their data practices. Simultaneously, researchers in natural language processing (NLP) have developed corpora and annotation schemes for extracting salient information from privacy policies, often independently of specific laws. To connect existing NLP research on privacy policies with the GDPR, we introduce a mapping from GDPR provisions to the OPP-115 annotation scheme, which serves as the basis for a growing number of projects to automatically classify privacy policy text. We show that assumptions made in the annotation scheme about the essential topics for a privacy policy reflect many of the same topics that the GDPR requires in these documents. This suggests that OPP-115 continues to be representative of the anatomy of a legally compliant privacy policy, and that the legal assumptions behind it represent the elements of data processing that ought to be disclosed within a policy for transparency. The correspondences we show between OPP-115 and the GDPR suggest the feasibility of bridging existing computational and legal research on privacy policies, benefiting both areas. 
    more » « less