skip to main content


Title: Comparison of Social Media, Syndromic Surveillance, and Microbiologic Acute Respiratory Infection Data: Observational Study
Background Internet data can be used to improve infectious disease models. However, the representativeness and individual-level validity of internet-derived measures are largely unexplored as this requires ground truth data for study. Objective This study sought to identify relationships between Web-based behaviors and/or conversation topics and health status using a ground truth, survey-based dataset. Methods This study leveraged a unique dataset of self-reported surveys, microbiological laboratory tests, and social media data from the same individuals toward understanding the validity of individual-level constructs pertaining to influenza-like illness in social media data. Logistic regression models were used to identify illness in Twitter posts using user posting behaviors and topic model features extracted from users’ tweets. Results Of 396 original study participants, only 81 met the inclusion criteria for this study. Of these participants’ tweets, we identified only two instances that were related to health and occurred within 2 weeks (before or after) of a survey indicating symptoms. It was not possible to predict when participants reported symptoms using features derived from topic models (area under the curve [AUC]=0.51; P=.38), though it was possible using behavior features, albeit with a very small effect size (AUC=0.53; P≤.001). Individual symptoms were also generally not predictable either. The study sample and a random sample from Twitter are predictably different on held-out data (AUC=0.67; P≤.001), meaning that the content posted by people who participated in this study was predictably different from that posted by random Twitter users. Individuals in the random sample and the GoViral sample used Twitter with similar frequencies (similar @ mentions, number of tweets, and number of retweets; AUC=0.50; P=.19). Conclusions To our knowledge, this is the first instance of an attempt to use a ground truth dataset to validate infectious disease observations in social media data. The lack of signal, the lack of predictability among behaviors or topics, and the demonstrated volunteer bias in the study population are important findings for the large and growing body of disease surveillance using internet-sourced data.  more » « less
Award ID(s):
1643576
NSF-PAR ID:
10146253
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
JMIR Public Health and Surveillance
Volume:
6
Issue:
2
ISSN:
2369-2960
Page Range / eLocation ID:
e14986
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Background Social networks such as Twitter offer the clinical research community a novel opportunity for engaging potential study participants based on user activity data. However, the availability of public social media data has led to new ethical challenges about respecting user privacy and the appropriateness of monitoring social media for clinical trial recruitment. Researchers have voiced the need for involving users’ perspectives in the development of ethical norms and regulations. Objective This study examined the attitudes and level of concern among Twitter users and nonusers about using Twitter for monitoring social media users and their conversations to recruit potential clinical trial participants. Methods We used two online methods for recruiting study participants: the open survey was (1) advertised on Twitter between May 23 and June 8, 2017, and (2) deployed on TurkPrime, a crowdsourcing data acquisition platform, between May 23 and June 8, 2017. Eligible participants were adults, 18 years of age or older, who lived in the United States. People with and without Twitter accounts were included in the study. Results While nearly half the respondents—on Twitter (94/603, 15.6%) and on TurkPrime (509/603, 84.4%)—indicated agreement that social media monitoring constitutes a form of eavesdropping that invades their privacy, over one-third disagreed and nearly 1 in 5 had no opinion. A chi-square test revealed a positive relationship between respondents’ general privacy concern and their average concern about Internet research (P<.005). We found associations between respondents’ Twitter literacy and their concerns about the ability for researchers to monitor their Twitter activity for clinical trial recruitment (P=.001) and whether they consider Twitter monitoring for clinical trial recruitment as eavesdropping (P<.001) and an invasion of privacy (P=.003). As Twitter literacy increased, so did people’s concerns about researchers monitoring Twitter activity. Our data support the previously suggested use of the nonexceptionalist methodology for assessing social media in research, insofar as social media-based recruitment does not need to be considered exceptional and, for most, it is considered preferable to traditional in-person interventions at physical clinics. The expressed attitudes were highly contextual, depending on factors such as the type of disease or health topic (eg, HIV/AIDS vs obesity vs smoking), the entity or person monitoring users on Twitter, and the monitored information. Conclusions The data and findings from this study contribute to the critical dialogue with the public about the use of social media in clinical research. The findings suggest that most users do not think that monitoring Twitter for clinical trial recruitment constitutes inappropriate surveillance or a violation of privacy. However, researchers should remain mindful that some participants might find social media monitoring problematic when connected with certain conditions or health topics. Further research should isolate factors that influence the level of concern among social media users across platforms and populations and inform the development of more clear and consistent guidelines. 
    more » « less
  2. Risk perception and risk averting behaviors of public agencies in the emergence and spread of COVID-19 can be retrieved through online social media (Twitter), and such interactions can be echoed in other information outlets. This study collected time-sensitive online social media data and analyzed patterns of health risk communication of public health and emergency agencies in the emergence and spread of novel coronavirus using data-driven methods. The major focus is toward understanding how policy-making agencies communicate risk and response information through social media during a pandemic and influence community response—ie, timing of lockdown, timing of reopening, etc.—and disease outbreak indicators—ie, number of confirmed cases and number of deaths. Twitter data of six major public organizations (1,000-4,500 tweets per organization) are collected from February 21, 2020 to June 6, 2020. Several machine learning algorithms, including dynamic topic model and sentiment analysis, are applied over time to identify the topic dynamics over the specific timeline of the pandemic. Organizations emphasized on various topics—eg, importance of wearing face mask, home quarantine, understanding the symptoms, social distancing and contact tracing, emerging community transmission, lack of personal protective equipment, COVID-19 testing and medical supplies, effect of tobacco, pandemic stress management, increasing hospitalization rate, upcoming hurricane season, use of convalescent plasma for COVID-19 treatment, maintaining hygiene, and the role of healthcare podcast in different timeline. The findings can benefit emergency management, policymakers, and public health agencies to identify targeted information dissemination policies for public with diverse needs based on how local, federal, and international agencies reacted to COVID-19. 
    more » « less
  3. Recent studies have documented increases in anti-Asian hate throughout the COVID-19 pandemic. Yet relatively little is known about how anti-Asian content on social media, as well as positive messages to combat the hate, have varied over time. In this study, we investigated temporal changes in the frequency of anti-Asian and counter-hate messages on Twitter during the first 16 months of the COVID-19 pandemic. Using the Twitter Data Collection Application Programming Interface, we queried all tweets from January 30, 2020 to April 30, 2021 that contained specific anti-Asian (e.g., #chinavirus, #kungflu) and counter-hate (e.g., #hateisavirus) keywords. From this initial data set, we extracted a random subset of 1,000 Twitter users who had used one or more anti-Asian or counter-hate keywords. For each of these users, we calculated the total number of anti-Asian and counter-hate keywords posted each month. Latent growth curve analysis revealed that the frequency of anti-Asian keywords fluctuated over time in a curvilinear pattern, increasing steadily in the early months and then decreasing in the later months of our data collection. In contrast, the frequency of counter-hate keywords remained low for several months and then increased in a linear manner. Significant between-user variability in both anti-Asian and counter-hate content was observed, highlighting individual differences in the generation of hate and counter-hate messages within our sample. Together, these findings begin to shed light on longitudinal patterns of hate and counter-hate on social media during the COVID-19 pandemic. 
    more » « less
  4. The objective of this paper is to propose and test a system analytics framework based on social sensing and text mining to detect topic evolution associated with the performance of infrastructure systems in disasters. Social media, like Twitter, as active channels of communication and information dissemination, provide insights into real-time information and first-hand experience from affected areas in mass emergencies. While the existing studies show the importance of social sensing in improving situational awareness and emergency response in disasters, the use of social sensing for detection and analysis of infrastructure systems and their resilience performance has been rather limited. This limitation is due to the lack of frameworks to model the events and topics (e.g., grid interruption and road closure) evolution associated with infrastructure systems (e.g., power, highway, airport, and oil) in times of disasters. The proposed framework detects infrastructure-related topics of the tweets posted in disasters and their evolutions by integrating searching relevant keywords, text lemmatization, Part-of-Speech (POS) tagging, TF-IDF vectorization, topic modeling by using Latent Dirichlet Allocation (LDA), and K-Means clustering. The application of the proposed framework was demonstrated in a study of infrastructure systems in Houston during Hurricane Harvey. In this case study, more than sixty thousand tweets were retrieved from 150-mile radius in Houston over 39 days. The analysis of topic detection and evolution from user-generated data were conducted, and the clusters of tweets pertaining to certain topics were mapped in networks over time. The results show that the proposed framework enables to summarize topics and track the movement of situations in different disaster phases. The analytics elements of the proposed framework can improve the recognition of infrastructure performance through text-based representation and provide evidence for decision-makers to take actionable measurements. 
    more » « less
  5. null (Ed.)
    During COVID-19, social media has played an important role for public health agencies and government stakeholders (i.e. actors) to disseminate information regarding situations, risks, and personal protective action inhibiting disease spread. However, there have been notable insufficient, incongruent, and inconsistent communications regarding the pandemic and its risks, which was especially salient at the early stages of the outbreak. Sufficiency, congruence and consistency in health risk communication have important implications for effective health safety instruction as well as critical content interpretability and recall. It also impacts individual- and community-level responses to information. This research employs text mining techniques and dynamic network analysis to investigate the actors’ risk and crisis communication on Twitter regarding message types, communication sufficiency, timeliness, congruence, consistency and coordination. We studied 13,598 pandemic-relevant tweets posted over January to April from 67 federal and state-level agencies and stakeholders in the U.S. The study annotates 16 categories of message types, analyzes their appearances and evolutions. The research then identifies inconsistencies and incongruencies on four critical topics and examines spatial disparities, timeliness, and sufficiency across actors and message types in communicating COVID-19. The network analysis also reveals increased communication coordination over time. The findings provide unprecedented insight of Twitter COVID-19 information dissemination which may help to inform public health agencies and governmental stakeholders future risk and crisis communication strategies related to global hazards in digital environments. 
    more » « less