skip to main content


Title: Automatic Classification of Web and IoT Privacy Policies
Privacy policies, despite the important information they provide about the collection and use of one's data, tend to be skipped over by most Internet users. In this paper, we seek to make privacy policies more accessible by automatically classifying web privacy. We use natural language processing techniques and multiple machine learning models to determine the effectiveness of each method in the classification method. We also explore the effectiveness of these methods to classify privacy policies of Internet of Things (IoT) devices.  more » « less
Award ID(s):
1950416
NSF-PAR ID:
10376459
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
REUNS 2022: The 8th National Workshop for REU Research in Networking and Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Privacy policies contain important information regarding the collection and use of user’s data. As Internet of Things (IoT) devices have become popular during the last years, these policies have become important to protect IoT users from unwanted use of private data collected through them. However, IoT policies tend to be long thus discouraging users to read them. In this paper, we seek to create an automated and annotated corpus for IoT privacy policies through the use of natural language processing techniques. Our method extracts the purpose from privacy policies and allows users to quickly find the important information relevant to their data collection/use. 
    more » « less
  2. Development of a comprehensive legal privacy framework in the United States should be based on identification of the common deficiencies of privacy policies. We attempt to delineate deficiencies by critically analyzing the privacy policies of mobile apps, application suites, social networks, Internet Service Providers, and Internet-of-Things devices. Whereas many studies have examined readability of privacy policies, few have specifically identified the information that should be provided in privacy policies but is not. Privacy legislation invariably starts a definition of personally identifiable information. We find that privacy policies’ definitions of personally identifiable information are far too restrictive, excluding information that does not itself identify a person but which can be used to reasonably identify a person, and excluding information paired with a device identifier which can be reasonably linked to a person. Legislation should define personally identifiable information to include such information, and should differentiate between information paired with a name versus information paired with a device identifier. Privacy legislation often excludes anonymous and de-identified information from notice and choice requirements. We find that privacy policies’ descriptions of anonymous and de-identified information are far too broad, including information paired with advertising identifiers. Computer science has repeatedly demonstrated that such information is reasonably linkable. Legislation should define these categories of information to align with technological abilities. Legislation should also not exempt de-identified information from notice requirements, to increase transparency. Privacy legislation relies heavily on notice requirements. We find that, because privacy policies’ disclosures of the uses of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are used for which purposes. Often, we cannot determine whether location or web browsing history is used solely for functional purposes or also for advertising. Legislation should require the disclosure of the purposes for each type of personal information collected. We also find that, because privacy policies disclosures of sharing of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are shared. Legislation should require the disclosure of the types of personal information shared. Finally, privacy legislation relies heavily on user choice. We find that free services often require the collection and sharing of personal information. As a result, users often have no choices. We find that whereas some paid services afford users a wide variety of choices, paid services in less competitive sectors often afford users few choices over use and sharing of personal information for purposes unrelated to the service. As a result, users are often unable to dictate which types of information they wish to allow to be shared, and which types they wish to allow to be used for advertising. Legislation should differentiate between take-it-or-leave it, opt-out, and opt-in approaches based on the type of use and on whether the information is shared. Congress should consider whether user choices should be affected by the presence of market power. 
    more » « less
  3. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less
  4. In this paper, we examine private-sector collection and use of metadata and telemetry information and provide three main contributions: First, we lay out the extent to which “non-content”—the hidden parts of Internet communications (aspects the user does not explicitly enter) and telemetry—are highly revelatory of personal behavior. We show that, privacy policies notwithstanding, users rarely know that the metadata and telemetry information is being collected and almost never know the uses to which it is being put. Second, we show that consumers, even if they knew the uses to which this type of personal information were being put, lack effective means to control the use of this type of data. The standard tool of notice-and-choice has well known problems, including the user’s lack of information with which to make a choice; and then, even if the user had sufficient information, doing so is not practical.49 These are greatly exacerbated by the nature of the interchanges for communications metadata and telemetry information. Each new transmission—each click on an internal link on a webpage, for example—may carry different implications for a user in terms of privacy. The current regimen, notice-and-choice, presents a completely unworkable set of requests for a user, who could well be responding many times a minute regarding whether to allow the use of metadata beyond the purposes of content delivery and display. This is especially the case for telemetry, where the ability to understand both present and future use of the data provided from the sensors requires a deeper understanding of what information these devices can provide than anyone but a trained engineer would know. Third, while there has been academic and industry research on telemetry’s use, there has been little exploration of the policy and legal implications stemming from that use. We provide this factor, while at the same time addressing the closely related issues raised by industry’s use of communications metadata to track user interests and behavior 
    more » « less
  5. Biometrics have been used increasingly heavily for identity authentication in many critical public services, such as border passes or security check points. However, traditional biometrics-based identity management systems collect and store personal biometrical data in a centralized server or database, and an individual has no control over how her biometrics will be used for what purpose. Such kind of systems can result in serious security and privacy issues for sensitive personal data. In this paper, we design a novel approach to leveraging biometrics and blockchain/smart contract to enable secure and privacy preserving identity management. The basic idea is to use blockchain to store an authority's attestation and the transformed value of an individual's biometrics. The stored data on the blockchain is then controlled by smart contracts which define various access control policies, e.g., access parties, access times, etc. The owner of the biometrical data can flexibly change the access control policies through a white list, a timer and other methods to any identity verifiers. We used the well-known Ethereum platform to implement the proposed approach and tested the effectiveness as well as the flexibility of various access control policies. 
    more » « less