skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Neural Network Prediction of Censorable Language
Internet censorship imposes restrictions on what information can be publicized or viewed on the Internet. According to Freedom House’s annual Freedom on the Net report, more than half the world’s Internet users now live in a place where the Internet is censored or restricted. China has built the world’s most extensive and sophisticated online censorship system. In this paper, we describe a new corpus of censored and uncensored social media tweets from a Chinese microblogging website, Sina Weibo, collected by tracking posts that mention ‘sensitive’ topics or authored by ‘sensitive’ users. We use this corpus to build a neural network classifier to predict censorship. Our model performs with a 88.50% accuracy using only linguistic features. We discuss these features in detail and hypothesize that they could potentially be used for censorship circumvention.  more » « less
Award ID(s):
1704113
PAR ID:
10463459
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 3rd Workshop on NLP and Computational Social Science (NLP+CSS) held in conjunction with NAACL 2019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This study provides preliminary insights into the linguistic features that contribute to Internet censorship in mainland China. We collected a corpus of 344 censored and uncensored microblog posts thatwere published onSinaWeibo and built a Naive Bayes classifier based on the linguistic, topic-independent, features. The classifier achieves a 79.34%accuracy in predicting whether a blog post would be censored on Sina Weibo. 
    more » « less
  2. This paper investigates the relationship between demographics and the frequency of censored posts (weibos) on Sina Weibo. Our results indicate that demographics such as location, gender and paid for features do not provide a good degree of predictive power but help explain how censorship is applied on social media. Using a dataset of 226 million weibos collected in 2012, we apply a binomial regression model to evaluate the predictive quality of user demographics to identify candidates that may be targeted for censorship. Our results suggest male users who are verified (pay for mobile and security features) are more likely to be censored than females or users who are not verified. In addition, users from provinces such as Hong Kong, Macao, and Beijing are more heavily censored compared to any other province in China over the same period. 
    more » « less
  3. This paper investigates censorship from a linguistic perspective. We collect a corpus of censored and uncensored posts on a number of topics, build a classifier that predicts censorship decisions independent of discussion topics. Our investigation reveals that the strongest linguistic indicator of censored content of our corpus is its readability. 
    more » « less
  4. Internet censorship is pervasive, with significant effort dedicated to understanding what is censored, and where. Prior censorship measurements however have identified significant inconsistencies in their results; experiments show unexplained non-deterministic behaviors thought to be caused by censor load, end-host geographic diversity, or incomplete censorship—inconsistencies which impede reliable, repeatable and correct understanding of global censorship. In this work we investigate the extent to which Equal-cost Multi-path (ECMP) routing is the cause for these inconsistencies, developing methods to measure and compensate for them. We find that ECMP routing significantly changes observed censorship across protocols, censor mechanisms, and in 18 countries. We identify that previously observed non-determinism or regional variations are attributable to measurements between fixed endhosts taking different routes based on Flow-ID; i.e., choice of intrasubnet source IP or ephemeral source port leads to differences in observed censorship. To achieve this we develop new route-stable censorship measurement methods that allow consistent measurement of DNS, HTTP, and HTTPS censorship. We find ECMP routing yields censorship changes across 42% of IPs and 51% of ASes, but that impact is not uniform. We develop an application-level traceroute tool to construct network paths using specific censored packets, leading us to identify numerous causes of the behavior, ranging from likely failed infrastructure, to routes to the same end-host taking geographically diverse paths which experience differences in censorship en-route. Finally, we compare our results to prior global measurements, demonstrating prior studies were possibly impacted by this phenomenon, and that specific results are explainable by ECMP routing. Our work points to methods for improving future studies, reducing inconsistencies and increasing repeatability 
    more » « less
  5. This paper studies how the linguistic components of blogposts collected from Sina Weibo, a Chinese microblogging platform, might affect the blogposts’ likelihood of being censored. Our results go along with King et al. (2013)’s Collective Action Potential (CAP) theory, which states that a blogpost’s potential of causing riot or assembly in real life is the key determinant of it getting censored. Although there is not a definitive measure of this construct, the linguistic features that we identify as discriminatory go along with the CAP theory. We build a classifier that significantly outperforms non-expert humans in predicting whether a blogpost will be censored. The crowdsourcing results suggest that while humans tend to see censored blogposts as more controversial and more likely to trigger action in real life than the uncensored counterparts, they in general cannot make a better guess than our model when it comes to ‘reading the mind’ of the censors in deciding whether a blogpost should be censored. We do not claim that censorship is only determined by the linguistic features. There are many other factors contributing to censorship decisions. The focus of the present paper is on the linguistic form of blogposts. Our work suggests that it is possible to use linguistic properties of social media posts to automatically predict if they are going to be censored. 
    more » « less