skip to main content


Title: A large-scale exploration of terms of service documents on the web
Terms of service documents are a common feature of organizations' websites. Although there is no blanket requirement for organizations to provide these documents, their provision often serves essential legal purposes. Users of a website are expected to agree with the contents of a terms of service document, but users tend to ignore these documents as they are often lengthy and difficult to comprehend. As a step towards understanding the landscape of these documents at a large scale, we present a first-of-its-kind terms of service corpus containing 247,212 English language terms of service documents obtained from company websites sampled from Free Company Dataset. We examine the URLs and contents of the documents and find that some websites that purport to post terms of service actually do not provide them. We analyze reasons for unavailability and determine the overall availability of terms of service in a given set of website domains. We also identify that some websites provide an agreement that combines terms of service with a privacy policy, which is often an obligatory separate document. Using topic modeling, we analyze the themes in these combined documents by comparing them with themes found in separate terms of service and privacy policies. Results suggest that such single-page agreements miss some of the most prevalent topics available in typical privacy policies and terms of service documents and that many disproportionately cover privacy policy topics as compared to terms of service topics.  more » « less
Award ID(s):
2105736
NSF-PAR ID:
10333523
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 21st ACM Symposium on Document Engineering
Page Range / eLocation ID:
1 to 4
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less
  2. null (Ed.)
    Cloud Legal documents, like Privacy Policies and Terms of Services (ToS), include key terms and rules that enable consumers to continuously monitor the performance of the cloud services used in their organization. To ensure high consumer confidence in the cloud service, it is necessary that these documents are clear and comprehensible to the average consumer. However, in practice, service providers often use legalese and ambiguous language in cloud legal documents resulting in consumers consenting or rejecting the terms without understanding the details. A measure capturing ambiguity in the texts of cloud service documents will enable consumers to decide if they understand what they are agreeing to, and deciding whether that service will meet their organizational requirements. It will also allow them to compare the service policies across various vendors. We have developed a novel model, ViCLOUD, that defines a scoring method based on linguistic cues to measure ambiguity in cloud legal documents and compare them to other peer websites. In this paper, we describe the ViCLOUD model in detail along with the validation results when applying it to 112 privacy policies and 108 Terms of Service documents of 115 cloud service vendors. The score distribution gives us a landscape of current trends in cloud services and a scale of comparison for new documentation. Our model will be very useful to organizations in making judicious decisions when selecting their cloud service. 
    more » « less
  3. With the rapid adoption of web services, the need to protect against various threats has become imperative for organizations operating in cyberspace. Organizations are increasingly opting to get financial cover in the event of losses due to a security incident. This helps them safeguard against the threat posed to third-party services that the organization uses. It is in the organization’s interest to understand the insurance requirements and procure all necessary direct and liability coverages. This helps transfer some risks to the insurance providers. However, cyber insurance policies often list details about coverages and exclusions using legalese that can be difficult to comprehend. Currently, it takes a significant manual effort to parse and extract knowledgeable rules from these lengthy and complicated policy documents. We have developed a semantically rich machine processable framework to automatically analyze cyber insurance policy and populate a knowledge graph that efficiently captures various inclusion and exclusion terms and rules embedded in the policy. In this paper, we describe this framework that has been built using technologies from AI, including Semantic Web, Modal/ Deontic Logic, and Natural Language Processing. We have validated our approach using industry standards proposed by the United States Federal Trade Commission (FTC) and applying it against publicly available policies of 7 cyber insurance vendors. Our system will enable cyber insurance seekers to automatically analyze various policy documents and make a well informed decision by identifying its inclusions and exclusions. 
    more » « less
  4. Website privacy policies sometimes provide users the option to opt-out of certain collections and uses of their personal data. Unfortunately, many privacy policies bury these instructions deep in their text, and few web users have the time or skill necessary to discover them. We describe a method for the automated detection of opt-out choices in privacy policy text and their presentation to users through a web browser extension. We describe the creation of two corpora of opt-out choices, which enable the training of classifiers to identify opt-outs in privacy policies. Our overall approach for extracting and classifying opt-out choices combines heuristics to identify commonly found opt-out hyperlinks with supervised machine learning to automatically identify less conspicuous instances. Our approach achieves a precision of 0.93 and a recall of 0.9. We introduce Opt-Out Easy, a web browser extension designed to present available opt-out choices to users as they browse the web. We evaluate the usability of our browser extension with a user study. We also present results of a large-scale analysis of opt-outs found in the text of thousands of the most popular websites. 
    more » « less
  5. With the rapid enhancements in technology and the adoption of web services, there has been a significant increase in cyber threats faced by organizations in cyberspace. Organizations want to purchase adequate cyber insurance to safeguard against the third-party services they use. However, cyber insurance policies describe their coverages and exclusions using legal jargon that can be difficult to comprehend. Parsing these policy documents and extracting the rules embedded in them is currently a very manual time-consuming process. We have developed a novel framework that automatically extracts the coverage and exclusion key terms and rules embedded in a cyber policy. We have built our framework using Information Retrieval and Artificial Intelligence techniques, specifically Semantic Web and Modal Logic. We have also developed a web interface where users can find the best matching cyber insurance policy based on particular coverage criteria. To validate our approach, we used industry standards proposed by the Federal Trade Commission document (FTC) and have applied it against publicly available policies of seven insurance providers. Our system will allow cyber insurance seekers to explore various policy documents and compare the paradigms mentioned in those documents while selecting the best relevant policy documents. 
    more » « less