skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Raiders of the lost kek: 3.5 years of augmented 4chan posts from the politically incorrect board
This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016–November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.  more » « less
Award ID(s):
1942610
PAR ID:
10212018
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the International AAAI Conference on Weblogs and Social Media
ISSN:
2334-0770
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Crises such as the COVID-19 pandemic continuously threaten our world and emotionally affect billions of people worldwide in distinct ways. Understanding the triggers leading to people’s emotions is of crucial importance. Social media posts can be a good source of such analysis, yet these texts tend to be charged with multiple emotions, with triggers scattering across multiple sentences. This paper takes a novel angle, namely, emotion detection and trigger summarization, aiming to both detect perceived emotions in text, and summarize events and their appraisals that trigger each emotion. To support this goal, we introduce CovidET (Emotions and their Triggers during Covid-19), a dataset of ~1,900 English Reddit posts related to COVID-19, which contains manual annotations of perceived emotions and abstractive summaries of their triggers described in the post. We develop strong baselines to jointly detect emotions and summarize emotion triggers. Our analyses show that CovidET presents new challenges in emotion-specific summarization, as well as multi-emotion detection in long social media posts. 
    more » « less
  2. Social media platforms provide users with various ways of interacting with each other, such as commenting, reacting to posts, sharing content, and uploading pictures. Facebook is one of the most popular platforms, and its users frequently share and reshare posts, including research articles. Moreover, the reactions feature on Facebook allows users to express their feelings towards the content they view, providing valuable data for analysis. This study aims to predict the emotional impact of Facebook posts relating to research articles. We collected data on Facebook posts related to various scientific research domains, including Health Sciences, Social Sciences, Dentistry, Arts, and Humanities. We observed Facebook users’ reactions towards research articles and posts and found that ‘Like’ reactions were the most common. We also noticed that research articles from the Dentistry research domain received a lot of ‘Haha’ reactions. We used machine learning models to predict the sentiment of Facebook posts related to research articles. We used features such as the research article’s title sentiment, abstract sentiment, abstract length, author count, and research domain to build the models. We used five classifiers: Random Forest, Decision Tree, K-Nearest Neighbors, Logistic Regression, and Naïve Bayes. The models were evaluated using accuracy, precision, recall, and F-1 score metrics. The Random Forest classifier was the best model for two- and three-class labels, achieving accuracy measures of 86% and 66%, respectively. We also evaluated the feature importance for the Random Forest model and found that the sentiment of the research article’s title is crucial in predicting the sentiment of the Facebook post. This study has substantial implications for public engagement in science-related messages. The emotional reactions of Facebook users towards research articles and posts can provide valuable insights into public engagement in science, and predicting the emotional impact of Facebook posts related to research articles can help researchers understand how the public perceives scientific research. The findings of the study can aid researchers in effectively communicating their research and engaging the public in scientific discourse. 
    more » « less
  3. Lynch, Collin F.; Merceron, Agathe; Desmarais, Michel; Nkambou, Roger (Ed.)
    Discussion forums are the primary channel for social interaction and knowledge sharing in Massive Open Online Courses (MOOCs). Many researchers have analyzed social connections on MOOC discussion forums. However, to the best of our knowledge, there is little research that distinguishes between the types of connections students make based upon the content of their forum posts. We analyze this effect by distinguishing on- and off-topic posts and comparing their respective social networks. We then analyze how these types of posts and their social connections can be used to predict the students’ final course performance. Pursuant to this work we developed a binary classifier to identify on- and off- topic posts and applied our analysis with the hand-coded and predicted labels. We conclude that the post type does affect the relationship between the students and their closest neighbors or community members clustered communities and their closest neighbor to their learning outcomes. 
    more » « less
  4. Budak, Ceren; Cha, Meeyoung; Quercia, Daniele; Xie, Lexing (Ed.)
    Despite the influence that image-based communication has on online discourse, the role played by images in disinformation is still not well understood. In this paper, we present the first large-scale study of fauxtography, analyzing the use of manipulated or misleading images in news discussion on online communities. First, we develop a computational pipeline geared to detect fauxtography, and identify over 61k instances of fauxtography discussed on Twitter, 4chan, and Reddit. Then, we study how posting fauxtography affects engagement of posts on social media, finding that posts containing it receive more interactions in the form of re-shares, likes, and comments. Finally, we show that fauxtography images are often turned into memes by Web communities. Our findings show that effective mitigation against disinformation need to take images into account, and highlight a number of challenges in dealing with image-based disinformation. 
    more » « less
  5. The public interest in accurate scientific communication, underscored by recent public health crises, highlights how content often loses critical pieces of information as it spreads online. However, multi-platform analyses of this phenomenon remain limited due to challenges in data collection. Collecting mentions of research tracked by Altmetric LLC, we examine information retention in over 4 million online posts referencing 9,765 of the most-mentioned scientific articles across blog sites, Facebook, news sites, Twitter, and Wikipedia. To do so, we present a burst-based framework for examining online discussions about science over time and across different platforms. To measure information retention, we develop a keyword-based computational measure comparing an online post to the scientific article’s abstract. We evaluate our measure using ground truth data labeled by within-field experts. We highlight three main findings: first, we find a strong tendency towards low levels of information retention, following a distinct trajectory of loss except when bursts of attention begin on social media. Second, platforms show significant differences in information retention. Third, sequences involving more platforms tend to be associated with higher information retention. These findings highlight a strong tendency towards information loss over time—posing a critical concern for researchers, policymakers, and citizens alike—but suggest that multi-platform discussions may improve information retention overall. 
    more » « less