Self-reported biographical strings on social media profiles provide a powerful tool to study self-identity. We present HINENI, a dataset of 420 million Twitter user profiles collected over a 12 year period, partitioned into 32 distinct national cohorts, which we believe is the largest publicly available data resource for identity research. We report on the major design decisions underlying HINENI, including a new notion of sampling (k-persistence) which spans the divide between traditional cross-sectional and longitudinal approaches. We demonstrate the power of HINENI to study the relative survival rate (half-life) of different tokens, and the use of emoji analysis across national cohorts to study the effects of gender, national, and sports identities.
more »
« less
This content will become publicly available on June 7, 2026
Statewise: Human Identity Investigator for the United States
Self-reported biographical strings on social media profiles provide a powerful tool to study personal identity. We present Statewise, a dataset based on 50 million unique Twitter user profiles over a 12 year period identified to be in the United States. Users within this dataset can be accurately partitioned into 52 states/territories at each observation, allowing queries into state-specific language choices over time. We report on the major design decisions underlying Statewise, including the methodology behind the location detection system and measurements of user/state transitions across time. We demonstrate the power of Statewise to study the relative prevalences of different token groups, showing clear and consistent regional differences in language usage. We analyze emoji usage by comparing inclusion rates against external state-level statistics, finding that emoji inclusion shares a significant correlation with state unemployment and poverty rates. Finally, we use Gini coefficients as a measure of token usage inequality across all observed territories and demonstrate a clear stratification based on token content.
more »
« less
- Award ID(s):
- 2208664
- PAR ID:
- 10657606
- Publisher / Repository:
- Proceedings of the International AAAI Conference on Web and Social Media
- Date Published:
- Journal Name:
- Proceedings of the International AAAI Conference on Web and Social Media
- Volume:
- 19
- ISSN:
- 2162-3449
- Page Range / eLocation ID:
- 2465 to 2476
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Mitrovic, A.; Bosch, N. (Ed.)Emoji are commonly used in social media to convey attitudes and emotions. While popular, their use in educational contexts has been sparsely studied. This paper reports on the students’ use of emoji in an online course forum in which students annotate and discuss course material in the margins of the online textbook. For this study, instructors created 11 custom emoji-hashtag pairs that enabled students to quickly communicate affects and reactions in the forum that they experienced while interacting with the course material. Example reporting includes, inviting discussion about a topic, declaring a topic as interesting, or requesting assistance about a topic. We analyze emoji usage by over 1,800 students enrolled in multiple offerings of the same course across multiple academic terms. The data show that some emoji frequently appear together in posts associated with the same paragraphs, suggesting that students use the emoji in this way to communicating complex affective states. We explore the use of computational models for predicting emoji at the post level, even when posts are lacking emoji. This capability can allow instructors to infer information about students’ affective states during their ”at home” interactions with course readings. Finally, we show that partitioning the emoji into distinct groups, rather than trying to predict individual emoji, can be both of pedagogical value to instructors and improve the predictive performance of our approach using the BERT language model. Our procedure can be generalized to other courses and for the benefit of other instructors.more » « less
-
Abstract Emoji are commonly used in social media to convey affects, emotions, and attitudes. While popular in social media, their use in educational contexts has been sparsely studied even though emoji can be a natural way for students to express what they are feeling about the learning material. This paper studies how students use instructor-selected emoji when relating to and engaging with educational content. We use an online platform for collaborative annotations where discussions are embedded into the readings for the course. We also make it possible for students to use 11 unique emoji-hashtag pairings to express their thoughts and feelings about the readings and the ongoing discussion. We provide an empirical analysis of the usage of these emoji-hashtag pairs by over 1,800 students enrolled in different offerings of an introductory biology course from multiple academic terms. We also introduce a heat map, which allows the instructional team to visualize the distribution and types of emoji used by students in different parts of the reading material. To evaluate the heat map, we conducted a user study with five instructors/TAs. We found that instructors/TAs use the heat map as a tool for identifying textbook sections that students find difficult and/or interesting and plan to use it to help them design the online content for future classes. Finally, we introduce a computational analysis for predicting emoji/hashtag pairs based on the content of a given student post. We use pre-trained deep learning language models (BERT) to predict the emoji attached to a student’s post and then study the extent to which this model generated in an introductory biology course can be generalized to predict student emoji usage in other courses.more » « less
-
Instruction tuning is essential for Large Language Models (LLMs) to effectively follow user instructions. To improve training efficiency and reduce data redundancy, recent works use LLM-based scoring functions, e.g., Instruction-Following Difficulty (IFD), to select high–quality instruction-tuning data with scores above a threshold. While these data selection methods often lead to models that can match or even exceed the performance of models trained on the full datasets, we identify two key limitations: (i) they assess quality at the sample level, ignoring token-level informativeness; and (ii) they overlook the robustness of the scoring method, often selecting a sample due to superficial lexical features instead of its true quality. In this work, we propose Token-Selective HIeRarchical Data Selection for Instruction Tuning (T-SHIRT), a novel data selection framework that introduces a new scoring method to include only informative tokens in quality evaluation and also promote robust and reliable samples whose neighbors also show high quality with less local inconsistencies. We demonstrate that models instruction-tuned on a curated dataset (only 5% of the original size) using T-SHIRT can outperform those trained on the entire large-scale dataset by up to 5.48 points on average across eight benchmarks. Across various LLMs and training set scales, our method consistently surpasses existing state-of-the-art data selection techniques, while also remaining both cost-effective and highly efficient. For instance, by using GPT-2 for score computation, we are able to process a dataset of 52k samples in 40 minutes on a single GPU. Our code is available at https://github.com/Dynamite321/T-SHIRT.more » « less
-
This dataset includes information regarding 55 State Hazard Mitigation Plans (SHMPs) for all 50 U.S. states and the 5 inhabited U.S. territories. Washington DC’s SHMP is not publicly available and can only be accessed through signing a Non-Disclosure Agreement. As such, it is not included in our dataset. These plans were approved by FEMA between 2016-2021. This data publication includes: (1) the State Hazard Mitigation Plan dataset; (2) a data dictionary with description of each variable output; and (3) variable definitions for the population groups included in State Hazard Mitigation Plans. The dataset was generated through reviewing publicly available plans and coding for socially vulnerable populations. This dataset allows for quantitative analysis of the inclusion (or exclusion) of vulnerable populations in SHMPs. The envisioned audience for this data and information includes government officials, researchers, and others who are interested in understanding patterns in the inclusion or exclusion of socially vulnerable populations in SHMPs.more » « less
An official website of the United States government
