Abstract There is a growing interest in using social media content for Natural Language Processing applications. However, it is not easy to computationally identify the most relevant set of tweets related to any specific event. Challenging semantics coupled with different ways for using natural language in social media make it difficult for retrieving the most relevant set of data from any social media outlet. This paper seeks to demonstrate a way to present the changing semantics of Twitter within the context of a crisis event, specifically tweets during Hurricane Irma. These methods can be used to identify the most relevant corpus of text for analysis in relevance to a specific incident such as a hurricane. Using an implementation of the Word2Vec method of Neural Network training mechanisms to create Word Embeddings, this paper will: discuss how the relative meaning of words changes as events unfold; present a mechanism for scoring tweets based upon dynamic, relative context relatedness; and show that similarity between words is not necessarily static. We present different methods for training the vector model in Word2Vec for identification of the most relevant tweets for any search query. The impact of tuning parameters such as Word Window Size, Minimum Word Frequency, Hidden Layer Dimensionality, and Negative Sampling on model performance was explored. The window containing the local maximum for AU_ROC for each parameter serves as a guide for other studies using the methods presented here for social media data analysis.
more »
« less
Self-Supervised Euphemism Detection and Identification for Content Moderation
Fringe groups and organizations have a long history of using euphemisms---ordinary-sounding words with a secret meaning---to conceal what they are discussing. Nowadays, one common use of euphemisms is to evade content moderation policies enforced by social media platforms. Existing tools for enforcing policy automatically rely on keyword searches for words on a ``ban list'', but these are notoriously imprecise: even when limited to swearwords, they can still cause embarrassing false positives. When a commonly used ordinary word acquires a euphemistic meaning, adding it to a keyword-based ban list is hopeless: consider ``pot'' (storage container or marijuana?) or ``heater'' (household appliance or firearm?). The current generation of social media companies instead hire staff to check posts manually, but this is expensive, inhumane, and not much more effective. It is usually apparent to a human moderator that a word is being used euphemistically, but they may not know what the secret meaning is, and therefore whether the message violates policy. Also, when a euphemism is banned, the group that used it need only invent another one, leaving moderators one step behind. This paper will demonstrate unsupervised algorithms that, by analyzing words in their sentence-level context, can both detect words being used euphemistically, and identify the secret meaning of each word. Compared to the existing state of the art, which uses context-free word embeddings, our algorithm for detecting euphemisms achieves 30--400\% higher detection accuracies of unlabeled euphemisms in a text corpus. Our algorithm for revealing euphemistic meanings of words is the first of its kind, as far as we are aware. In the arms race between content moderators and policy evaders, our algorithms may help shift the balance in the direction of the moderators.
more »
« less
- Award ID(s):
- 1720268
- PAR ID:
- 10292066
- Date Published:
- Journal Name:
- 2021 IEEE Symposium on Security and Privacy (SP)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
As content moderation becomes a central aspect of all social media platforms and online communities, interest has grown in how to make moderation decisions contestable. On social media platforms where individual communities moderate their own activities, the responsibility to address user appeals falls on volunteers from within the community. While there is a growing body of work devoted to understanding and supporting the volunteer moderators' workload, little is known about their practice of handling user appeals. Through a collaborative and iterative design process with Reddit moderators, we found that moderators spend considerable effort in investigating user ban appeals and desired to directly engage with users and retain their agency over each decision. To fulfill their needs, we designed and built AppealMod, a system that induces friction in the appeals process by asking users to provide additional information before their appeals are reviewed by human moderators. In addition to giving moderators more information, we expected the friction in the appeal process would lead to a selection effect among users, with many insincere and toxic appeals being abandoned before getting any attention from human moderators. To evaluate our system, we conducted a randomized field experiment in a Reddit community of over 29 million users that lasted for four months. As a result of the selection effect, moderators viewed only 30% of initial appeals and less than 10% of the toxically worded appeals; yet they granted roughly the same number of appeals when compared with the control group. Overall, our system is effective at reducing moderator workload and minimizing their exposure to toxic content while honoring their preference for direct engagement and agency in appeals.more » « less
-
Research suggests that marginalized social media users face disproportionate content moderation and removal. However, when content is removed or accounts suspended, the processes governing content moderation are largely invisible, making assessing content moderation bias difficult. To study this bias, we conducted a digital ethnography of marginalized users on Reddit’s /r/FTM subreddit and Twitch’s “Just Chatting” and “Pools, Hot Tubs, and Beaches” categories, observing content moderation visibility in real time. We found that on Reddit, a text-based platform, platform tools make content moderation practices invisible to users, but moderators make their practices visible through communication with users. Yet on Twitch, a live chat and streaming platform, content moderation practices are visible in channel live chats, “unban appeal” streams, and “back from my ban” streams. Our ethnography shows how content moderation visibility differs in important ways between social media platforms, harming those who must see offensive content, and at other times, allowing for increased platform accountability.more » « less
-
Abstract Political and social scientists have been relying extensively on keywords such as hashtags to mine social movement data from social media sites, particularly Twitter. Yet, prior work demonstrates that unrepresentative keyword sets can lead to flawed research conclusions. Numerous keyword expansion methods have been proposed to increase the comprehensiveness of keywords, but systematic evaluations of these methods have been lacking. Our paper fills this gap. We evaluate five diverse keyword expansion techniques (or pipelines) on five representative social movements across two distinct activity levels. Our results guide researchers who aim to use social media keyword searches to mine data. For instance, we show that word embedding-based methods significantly outperform other even more complex and newer approaches when movements are in normal activity periods. These methods are also less computationally intensive. More importantly, we also observe that no single pipeline can identify little more than half of all movement-related tweets when these movements are at their peak mobilization period offline. However, coverage can increase significantly when more than one pipeline is used. This is true even when the pipelines are selected at random.more » « less
-
It is well-known that children rapidly learn words, following a range of heuristics. What is less well appreciated is that – because most words are polysemous and have multiple meanings (e.g., ‘glass’ can label a material and drinking vessel) – children will often be learning a new meaning for a known word, rather than an entirely new word. Across four experiments we show that children flexibly adapt a well-known heuristic – the shape bias – when learning polysemous words. Consistent with previous studies, we find that children and adults preferentially extend a new object label to other objects of the same shape. But we also find that when a new word for an object (‘a gup’) has previously been used to label the material composing that object (‘some gup’), children and adults override the shape bias, and are more likely to extend the object label by material (Experiments 1 and 3). Further, we find that, just as an older meaning of a polysemous word constrains interpretations of a new word meaning, encountering a new word meaning leads learners to update their interpretations of an older meaning (Experiment 2). Finally, we find that these effects only arise when learners can perceive that a word’s meanings are related, not when they are arbitrarily paired (Experiment 4). Together, these findings show that children can exploit cues from polysemy to infer how new word meanings should be extended, suggesting that polysemy may facilitate word learning and invite children to construe categories in new ways.more » « less
An official website of the United States government

