skip to main content

Title: Having your Privacy Cake and Eating it Too: Platform-supported Auditing of Social Media Algorithms for Public Interest
Social media platforms curate access to information and opportunities, and so play a critical role in shaping public discourse today. The opaque nature of the algorithms these platforms use to curate content raises societal questions. Prior studies have used black-box methods led by experts or collaborative audits driven by everyday users to show that these algorithms can lead to biased or discriminatory outcomes. However, existing auditing methods face fundamental limitations because they function independent of the platforms. Concerns of potential harmful outcomes have prompted proposal of legislation in both the U.S. and the E.U. to mandate a new form of auditing where vetted external researchers get privileged access to social media platforms. Unfortunately, to date there have been no concrete technical proposals to provide such auditing, because auditing at scale risks disclosure of users' private data and platforms' proprietary algorithms. We propose a new method for platform-supported auditing that can meet the goals of the proposed legislation. The first contribution of our work is to enumerate the challenges and the limitations of existing auditing methods to implement these policies at scale. Second, we suggest that limited, privileged access to relevance estimators is the key to enabling generalizable platform-supported auditing of social media platforms by external researchers. Third, we show platform-supported auditing need not risk user privacy nor disclosure of platforms' business interests by proposing an auditing framework that protects against these risks. For a particular fairness metric, we show that ensuring privacy imposes only a small constant factor increase (6.34x as an upper bound, and 4× for typical parameters) in the number of samples required for accurate auditing. Our technical contributions, combined with ongoing legal and policy efforts, can enable public oversight into how social media platforms affect individuals and society by moving past the privacy-vs-transparency hurdle.  more » « less
Award ID(s):
1943584 1956435 1916153 2333448
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the ACM on Human-Computer Interaction
Page Range / eLocation ID:
1 to 33
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The concern regarding users’ data privacy has risen to its highest level due to the massive increase in communication platforms, social networking sites, and greater users’ participation in online public discourse. An increasing number of people exchange private information via emails, text messages, and social media without being aware of the risks and implications. Researchers in the field of Natural Language Processing (NLP) have concentrated on creating tools and strategies to identify, categorize, and sanitize private information in text data since a substantial amount of data is exchanged in textual form. However, most of the detection methods solely rely on the existence of pre-identified keywords in the text and disregard the inference of underlying meaning of the utterance in a specific context. Hence, in some situations these tools and algorithms fail to detect disclosure, or the produced results are miss classified. In this paper, we propose a multi-input, multi-output hybrid neural network which utilizes transfer-learning, linguistics, and metadata to learn the hidden patterns. Our goal is to better classify disclosure/non-disclosure content in terms of the context of situation. We trained and evaluated our model on a human-annotated ground truth dataset, containing a total of 5,400 tweets. The results show that the proposed model was able to identify privacy disclosure through tweets with an accuracy of 77.4% while classifying the information type of those tweets with an impressive accuracy of 99%, by jointly learning for two separate tasks.

    more » « less
  2. Background Social networks such as Twitter offer the clinical research community a novel opportunity for engaging potential study participants based on user activity data. However, the availability of public social media data has led to new ethical challenges about respecting user privacy and the appropriateness of monitoring social media for clinical trial recruitment. Researchers have voiced the need for involving users’ perspectives in the development of ethical norms and regulations. Objective This study examined the attitudes and level of concern among Twitter users and nonusers about using Twitter for monitoring social media users and their conversations to recruit potential clinical trial participants. Methods We used two online methods for recruiting study participants: the open survey was (1) advertised on Twitter between May 23 and June 8, 2017, and (2) deployed on TurkPrime, a crowdsourcing data acquisition platform, between May 23 and June 8, 2017. Eligible participants were adults, 18 years of age or older, who lived in the United States. People with and without Twitter accounts were included in the study. Results While nearly half the respondents—on Twitter (94/603, 15.6%) and on TurkPrime (509/603, 84.4%)—indicated agreement that social media monitoring constitutes a form of eavesdropping that invades their privacy, over one-third disagreed and nearly 1 in 5 had no opinion. A chi-square test revealed a positive relationship between respondents’ general privacy concern and their average concern about Internet research (P<.005). We found associations between respondents’ Twitter literacy and their concerns about the ability for researchers to monitor their Twitter activity for clinical trial recruitment (P=.001) and whether they consider Twitter monitoring for clinical trial recruitment as eavesdropping (P<.001) and an invasion of privacy (P=.003). As Twitter literacy increased, so did people’s concerns about researchers monitoring Twitter activity. Our data support the previously suggested use of the nonexceptionalist methodology for assessing social media in research, insofar as social media-based recruitment does not need to be considered exceptional and, for most, it is considered preferable to traditional in-person interventions at physical clinics. The expressed attitudes were highly contextual, depending on factors such as the type of disease or health topic (eg, HIV/AIDS vs obesity vs smoking), the entity or person monitoring users on Twitter, and the monitored information. Conclusions The data and findings from this study contribute to the critical dialogue with the public about the use of social media in clinical research. The findings suggest that most users do not think that monitoring Twitter for clinical trial recruitment constitutes inappropriate surveillance or a violation of privacy. However, researchers should remain mindful that some participants might find social media monitoring problematic when connected with certain conditions or health topics. Further research should isolate factors that influence the level of concern among social media users across platforms and populations and inform the development of more clear and consistent guidelines. 
    more » « less
  3. Development of a comprehensive legal privacy framework in the United States should be based on identification of the common deficiencies of privacy policies. We attempt to delineate deficiencies by critically analyzing the privacy policies of mobile apps, application suites, social networks, Internet Service Providers, and Internet-of-Things devices. Whereas many studies have examined readability of privacy policies, few have specifically identified the information that should be provided in privacy policies but is not. Privacy legislation invariably starts a definition of personally identifiable information. We find that privacy policies’ definitions of personally identifiable information are far too restrictive, excluding information that does not itself identify a person but which can be used to reasonably identify a person, and excluding information paired with a device identifier which can be reasonably linked to a person. Legislation should define personally identifiable information to include such information, and should differentiate between information paired with a name versus information paired with a device identifier. Privacy legislation often excludes anonymous and de-identified information from notice and choice requirements. We find that privacy policies’ descriptions of anonymous and de-identified information are far too broad, including information paired with advertising identifiers. Computer science has repeatedly demonstrated that such information is reasonably linkable. Legislation should define these categories of information to align with technological abilities. Legislation should also not exempt de-identified information from notice requirements, to increase transparency. Privacy legislation relies heavily on notice requirements. We find that, because privacy policies’ disclosures of the uses of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are used for which purposes. Often, we cannot determine whether location or web browsing history is used solely for functional purposes or also for advertising. Legislation should require the disclosure of the purposes for each type of personal information collected. We also find that, because privacy policies disclosures of sharing of personal information are disconnected from their disclosures about the types of personal information collected, we are often unable to determine which types of information are shared. Legislation should require the disclosure of the types of personal information shared. Finally, privacy legislation relies heavily on user choice. We find that free services often require the collection and sharing of personal information. As a result, users often have no choices. We find that whereas some paid services afford users a wide variety of choices, paid services in less competitive sectors often afford users few choices over use and sharing of personal information for purposes unrelated to the service. As a result, users are often unable to dictate which types of information they wish to allow to be shared, and which types they wish to allow to be used for advertising. Legislation should differentiate between take-it-or-leave it, opt-out, and opt-in approaches based on the type of use and on whether the information is shared. Congress should consider whether user choices should be affected by the presence of market power. 
    more » « less
  4. Illicit drug trafficking via social media sites such as Instagram have become a severe problem, thus drawing a great deal of attention from law enforcement and public health agencies. How to identify illicit drug dealers from social media data has remained a technical challenge for the following reasons. On the one hand, the available data are limited because of privacy concerns with crawling social media sites; on the other hand, the diversity of drug dealing patterns makes it difficult to reliably distinguish drug dealers from common drug users. Unlike existing methods that focus on posting-based detection, we propose to tackle the problem of illicit drug dealer identification by constructing a large-scale multimodal dataset named Identifying Drug Dealers on Instagram (IDDIG). Nearly 4,000 user accounts, of which more than 1,400 are drug dealers, have been collected from Instagram with multiple data sources including post comments, post images, homepage bio, and homepage images. We then design a quadruple-based multimodal fusion method to combine the multiple data sources associated with each user account for drug dealer identification. Experimental results on the constructed IDDIG dataset demonstrate the effectiveness of the proposed method in identifying drug dealers (almost 95% accuracy). Moreover, we have developed a hashtag-based community detection technique for discovering evolving patterns, especially those related to geography and drug types. 
    more » « less
  5. Social media platforms make trade-offs in their design and policy decisions to attract users and stand out from other platforms. These decisions are influenced by a number of considerations, e.g. what kinds of content moderation to deploy or what kinds of resources a platform has access to. Their choices play into broader political tensions; social media platforms are situated within a social context that frames their impact, and they can have politics through their design that enforce power structures and serve existing authorities. We turn to Pillowfort, a small social media platform, to examine these political tensions as a case study. Using a discourse analysis, we examine public discussion posts between staff and users as they negotiate the site's development over a period of two years. Our findings illustrate the tensions in navigating the politics that users bring with them from previous platforms, the difficulty of building a site's unique identity and encouraging commitment, and examples of how design decisions can both foster and break trust with users. Drawing from these findings, we discuss how the success and failure of new social media platforms are impacted by political influences on design and policy decisions. 
    more » « less