This study provides preliminary insights into the linguistic features that contribute to Internet censorship in mainland China. We collected a corpus of 344 censored and uncensored microblog posts thatwere published onSinaWeibo and built a Naive Bayes classifier based on the linguistic, topic-independent, features. The classifier achieves a 79.34%accuracy in predicting whether a blog post would be censored on Sina Weibo. 
                        more » 
                        « less   
                    
                            
                            Linguistic Fingerprints of Internet Censorship: the Case of Sina Weibo
                        
                    
    
            This paper studies how the linguistic components of blogposts collected from Sina Weibo, a Chinese microblogging platform, might affect the blogposts’ likelihood of being censored. Our results go along with King et al. (2013)’s Collective Action Potential (CAP) theory, which states that a blogpost’s potential of causing riot or assembly in real life is the key determinant of it getting censored. Although there is not a definitive measure of this construct, the linguistic features that we identify as discriminatory go along with the CAP theory. We build a classifier that significantly outperforms non-expert humans in predicting whether a blogpost will be censored. The crowdsourcing results suggest that while humans tend to see censored blogposts as more controversial and more likely to trigger action in real life than the uncensored counterparts, they in general cannot make a better guess than our model when it comes to ‘reading the mind’ of the censors in deciding whether a blogpost should be censored. We do not claim that censorship is only determined by the linguistic features. There are many other factors contributing to censorship decisions. The focus of the present paper is on the linguistic form of blogposts. Our work suggests that it is possible to use linguistic properties of social media posts to automatically predict if they are going to be censored. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1704113
- PAR ID:
- 10463470
- Date Published:
- Journal Name:
- Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Internet censorship imposes restrictions on what information can be publicized or viewed on the Internet. According to Freedom House’s annual Freedom on the Net report, more than half the world’s Internet users now live in a place where the Internet is censored or restricted. China has built the world’s most extensive and sophisticated online censorship system. In this paper, we describe a new corpus of censored and uncensored social media tweets from a Chinese microblogging website, Sina Weibo, collected by tracking posts that mention ‘sensitive’ topics or authored by ‘sensitive’ users. We use this corpus to build a neural network classifier to predict censorship. Our model performs with a 88.50% accuracy using only linguistic features. We discuss these features in detail and hypothesize that they could potentially be used for censorship circumvention.more » « less
- 
            This paper investigates censorship from a linguistic perspective. We collect a corpus of censored and uncensored posts on a number of topics, build a classifier that predicts censorship decisions independent of discussion topics. Our investigation reveals that the strongest linguistic indicator of censored content of our corpus is its readability.more » « less
- 
            This paper investigates the relationship between demographics and the frequency of censored posts (weibos) on Sina Weibo. Our results indicate that demographics such as location, gender and paid for features do not provide a good degree of predictive power but help explain how censorship is applied on social media. Using a dataset of 226 million weibos collected in 2012, we apply a binomial regression model to evaluate the predictive quality of user demographics to identify candidates that may be targeted for censorship. Our results suggest male users who are verified (pay for mobile and security features) are more likely to be censored than females or users who are not verified. In addition, users from provinces such as Hong Kong, Macao, and Beijing are more heavily censored compared to any other province in China over the same period.more » « less
- 
            Information-centric network (ICN) designs are susceptible to censorship especially packet filtering based on content names. Previous works on censorship circumvention in ICN either have high processing times or use proxies that can be blocked easily by the censoring agents. We design a new censorship circumvention approach for ICN using router redirection that enables a client in a censored region to retrieve blocked content from a censored destination without the censoring agent detecting the use of a censorship circumvention tool. We conduct ndnSIM-based simulation experiments showing that our approach is practical with only a modest end-to-end delay overhead.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    