skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Statewise: Human Identity Investigator for the United States
Self-reported biographical strings on social media profiles provide a powerful tool to study personal identity. We present Statewise, a dataset based on 50 million unique Twitter user profiles over a 12 year period identified to be in the United States. Users within this dataset can be accurately partitioned into 52 states/territories at each observation, allowing queries into state-specific language choices over time. We report on the major design decisions underlying Statewise, including the methodology behind the location detection system and measurements of user/state transitions across time. We demonstrate the power of Statewise to study the relative prevalences of different token groups, showing clear and consistent regional differences in language usage. We analyze emoji usage by comparing inclusion rates against external state-level statistics, finding that emoji inclusion shares a significant correlation with state unemployment and poverty rates. Finally, we use Gini coefficients as a measure of token usage inequality across all observed territories and demonstrate a clear stratification based on token content.  more » « less
Award ID(s):
2208664
PAR ID:
10657606
Author(s) / Creator(s):
; ;
Publisher / Repository:
Proceedings of the International AAAI Conference on Web and Social Media
Date Published:
Journal Name:
Proceedings of the International AAAI Conference on Web and Social Media
Volume:
19
ISSN:
2162-3449
Page Range / eLocation ID:
2465 to 2476
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Self-reported biographical strings on social media profiles provide a powerful tool to study self-identity. We present HINENI, a dataset of 420 million Twitter user profiles collected over a 12 year period, partitioned into 32 distinct national cohorts, which we believe is the largest publicly available data resource for identity research. We report on the major design decisions underlying HINENI, including a new notion of sampling (k-persistence) which spans the divide between traditional cross-sectional and longitudinal approaches. We demonstrate the power of HINENI to study the relative survival rate (half-life) of different tokens, and the use of emoji analysis across national cohorts to study the effects of gender, national, and sports identities. 
    more » « less
  2. Abstract Emoji are commonly used in social media to convey affects, emotions, and attitudes. While popular in social media, their use in educational contexts has been sparsely studied even though emoji can be a natural way for students to express what they are feeling about the learning material. This paper studies how students use instructor-selected emoji when relating to and engaging with educational content. We use an online platform for collaborative annotations where discussions are embedded into the readings for the course. We also make it possible for students to use 11 unique emoji-hashtag pairings to express their thoughts and feelings about the readings and the ongoing discussion. We provide an empirical analysis of the usage of these emoji-hashtag pairs by over 1,800 students enrolled in different offerings of an introductory biology course from multiple academic terms. We also introduce a heat map, which allows the instructional team to visualize the distribution and types of emoji used by students in different parts of the reading material. To evaluate the heat map, we conducted a user study with five instructors/TAs. We found that instructors/TAs use the heat map as a tool for identifying textbook sections that students find difficult and/or interesting and plan to use it to help them design the online content for future classes. Finally, we introduce a computational analysis for predicting emoji/hashtag pairs based on the content of a given student post. We use pre-trained deep learning language models (BERT) to predict the emoji attached to a student’s post and then study the extent to which this model generated in an introductory biology course can be generalized to predict student emoji usage in other courses. 
    more » « less
  3. Mitrovic, A.; Bosch, N. (Ed.)
    Emoji are commonly used in social media to convey attitudes and emotions. While popular, their use in educational contexts has been sparsely studied. This paper reports on the students’ use of emoji in an online course forum in which students annotate and discuss course material in the margins of the online textbook. For this study, instructors created 11 custom emoji-hashtag pairs that enabled students to quickly communicate affects and reactions in the forum that they experienced while interacting with the course material. Example reporting includes, inviting discussion about a topic, declaring a topic as interesting, or requesting assistance about a topic. We analyze emoji usage by over 1,800 students enrolled in multiple offerings of the same course across multiple academic terms. The data show that some emoji frequently appear together in posts associated with the same paragraphs, suggesting that students use the emoji in this way to communicating complex affective states. We explore the use of computational models for predicting emoji at the post level, even when posts are lacking emoji. This capability can allow instructors to infer information about students’ affective states during their ”at home” interactions with course readings. Finally, we show that partitioning the emoji into distinct groups, rather than trying to predict individual emoji, can be both of pedagogical value to instructors and improve the predictive performance of our approach using the BERT language model. Our procedure can be generalized to other courses and for the benefit of other instructors. 
    more » « less
  4. BackgroundStay-at-home orders were one of the controversial interventions to curb the spread of COVID-19 in the United States. The stay-at-home orders, implemented in 51 states and territories between March 7 and June 30, 2020, impacted the lives of individuals and communities and accelerated the heavy usage of web-based social networking sites. Twitter sentiment analysis can provide valuable insight into public health emergency response measures and allow for better formulation and timing of future public health measures to be released in response to future public health emergencies. ObjectiveThis study evaluated how stay-at-home orders affect Twitter sentiment in the United States. Furthermore, this study aimed to understand the feedback on stay-at-home orders from groups with different circumstances and backgrounds. In addition, we particularly focused on vulnerable groups, including older people groups with underlying medical conditions, small and medium enterprises, and low-income groups. MethodsWe constructed a multiperiod difference-in-differences regression model based on the Twitter sentiment geographical index quantified from 7.4 billion geo-tagged tweets data to analyze the dynamics of sentiment feedback on stay-at-home orders across the United States. In addition, we used moderated effects analysis to assess differential feedback from vulnerable groups. ResultsWe combed through the implementation of stay-at-home orders, Twitter sentiment geographical index, and the number of confirmed cases and deaths in 51 US states and territories. We identified trend changes in public sentiment before and after the stay-at-home orders. Regression results showed that stay-at-home orders generated a positive response, contributing to a recovery in Twitter sentiment. However, vulnerable groups faced greater shocks and hardships during the COVID-19 pandemic. In addition, economic and demographic characteristics had a significant moderating effect. ConclusionsThis study showed a clear positive shift in public opinion about COVID-19, with this positive impact occurring primarily after stay-at-home orders. However, this positive sentiment is time-limited, with 14 days later allowing people to be more influenced by the status quo and trends, so feedback on the stay-at-home orders is no longer positively significant. In particular, negative sentiment is more likely to be generated in states with a large proportion of vulnerable groups, and the policy plays a limited role. The pandemic hit older people, those with underlying diseases, and small and medium enterprises directly but hurt states with cross-cutting economic situations and more complex demographics over time. Based on large-scale Twitter data, this sociological perspective allows us to monitor the evolution of public opinion more directly, assess the impact of social events on public opinion, and understand the heterogeneity in the face of pandemic shocks. 
    more » « less
  5. Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high–quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promote robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT. 
    more » « less