skip to main content


Search for: All records

Creators/Authors contains: "Huang, Xiaolei"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Guidi, Barbara (Ed.)
    The COVID-19 pandemic brought widespread attention to an “infodemic” of potential health misinformation. This claim has not been assessed based on evidence. We evaluated if health misinformation became more common during the pandemic. We gathered about 325 million posts sharing URLs from Twitter and Facebook during the beginning of the pandemic (March 8-May 1, 2020) compared to the same period in 2019. We relied on source credibility as an accepted proxy for misinformation across this database. Human annotators also coded a subsample of 3000 posts with URLs for misinformation. Posts about COVID-19 were 0.37 times as likely to link to “not credible” sources and 1.13 times more likely to link to “more credible” sources than prior to the pandemic. Posts linking to “not credible” sources were 3.67 times more likely to include misinformation compared to posts from “more credible” sources. Thus, during the earliest stages of the pandemic, when claims of an infodemic emerged, social media contained proportionally less misinformation than expected based on the prior year. Our results suggest that widespread health misinformation is not unique to COVID-19. Rather, it is a systemic feature of online health communication that can adversely impact public health behaviors and must therefore be addressed. 
    more » « less
  2. Language use varies across different demographic factors, such as gender, age, and geographic location. However, most existing document classification methods ignore demographic variability. In this study, we examine empirically how text data can vary across four demographic factors: gender, age, country, and region. We propose a multitask neural model to account for demographic variations via adversarial training. In experiments on four English-language social media datasets, we find that classification performance improves when adapting for user factors. 
    more » « less
  3. Language usage can change across periods of time, but document classifiers are usually trained and tested on corpora spanning multiple years without considering temporal variations. This paper describes two complementary ways to adapt classifiers to shifts across time. First, we show that diachronic word embeddings, which were originally developed to study language change, can also improve document classification, and we show a simple method for constructing this type of embedding. Second, we propose a time-driven neural classification model inspired by methods for domain adaptation. Experiments on six corpora show how these methods can make classifiers more robust over time. 
    more » « less
  4. Many corpora span broad periods of time. Language processing models trained during one time period may not work well in future time periods, and the best model may depend on specific times of year (e.g., people might describe hotels differently in reviews during the winter versus the summer). This study investigates how document classifiers trained on documents from certain time intervals perform on documents from other time intervals, considering both seasonal intervals (intervals that repeat across years, e.g., winter) and non-seasonal intervals (e.g., specific years). We show experimentally that classification performance varies over time, and that performance can be improved by using a standard domain adaptation approach to adjust for changes in time. 
    more » « less
  5. Introduction The Centers for Disease Control and Prevention (CDC) spend significant time and resources to track influenza vaccination coverage each influenza season using national surveys. Emerging data from social media provide an alternative solution to surveillance at both national and local levels of influenza vaccination coverage in near real time. Objectives This study aimed to characterise and analyse the vaccinated population from temporal, demographical and geographical perspectives using automatic classification of vaccination-related Twitter data. Methods In this cross-sectional study, we continuously collected tweets containing both influenza-related terms and vaccine-related terms covering four consecutive influenza seasons from 2013 to 2017. We created a machine learning classifier to identify relevant tweets, then evaluated the approach by comparing to data from the CDC’s FluVaxView. We limited our analysis to tweets geolocated within the USA. Results We assessed 1 124 839 tweets. We found strong correlations of 0.799 between monthly Twitter estimates and CDC, with correlations as high as 0.950 in individual influenza seasons. We also found that our approach obtained geographical correlations of 0.387 at the US state level and 0.467 at the regional level. Finally, we found a higher level of influenza vaccine tweets among female users than male users, also consistent with the results of CDC surveys on vaccine uptake. Conclusion Significant correlations between Twitter data and CDC data show the potential of using social media for vaccination surveillance. Temporal variability is captured better than geographical and demographical variability. We discuss potential paths forward for leveraging this approach. 
    more » « less