skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Hemphill, Libby"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Harmful textual content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to this issue is developing detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful textual content. We used ChatGPT to investigate this potential and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful textual content on social media: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with the provided HOT definitions. However, ChatGPT classifies “hateful” and “offensive” as subsets of “toxic.” Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these insights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understanding and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance on the potential of using generative AI models for moderating large volumes of user-generated textual content on social media.

     
    more » « less
    Free, publicly-accessible full text available May 31, 2025
  2. As content moderation becomes a central aspect of all social media platforms and online communities, interest has grown in how to make moderation decisions contestable. On social media platforms where individual communities moderate their own activities, the responsibility to address user appeals falls on volunteers from within the community. While there is a growing body of work devoted to understanding and supporting the volunteer moderators' workload, little is known about their practice of handling user appeals. Through a collaborative and iterative design process with Reddit moderators, we found that moderators spend considerable effort in investigating user ban appeals and desired to directly engage with users and retain their agency over each decision. To fulfill their needs, we designed and built AppealMod, a system that induces friction in the appeals process by asking users to provide additional information before their appeals are reviewed by human moderators. In addition to giving moderators more information, we expected the friction in the appeal process would lead to a selection effect among users, with many insincere and toxic appeals being abandoned before getting any attention from human moderators. To evaluate our system, we conducted a randomized field experiment in a Reddit community of over 29 million users that lasted for four months. As a result of the selection effect, moderators viewed only 30% of initial appeals and less than 10% of the toxically worded appeals; yet they granted roughly the same number of appeals when compared with the control group. Overall, our system is effective at reducing moderator workload and minimizing their exposure to toxic content while honoring their preference for direct engagement and agency in appeals.

     
    more » « less
    Free, publicly-accessible full text available April 17, 2025
  3. Researchers need to be able to find, access, and use data to participate in open science. To understand how users search for research data, we analyzed textual queries issued at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). We collected unique user queries from 988,475 user search sessions over four years (2012-16). Overall, we found that only 30% of site visitors entered search terms into the ICPSR website. We analyzed search strategies within these sessions by extending existing dataset search taxonomies to classify a subset of the 1,554 most popular queries. We identified five categories of commonly-issued queries: keyword-based (e.g., date, place, topic); name (e.g., study, series); identifier (e.g., study, series); author (e.g., institutional, individual); and type (e.g., file, format). While the dominant search strategy used short keywords to explore topics, directed searches for known items using study and series names were also common. We further distinguished exploratory browsing from directed search queries based on their page views, refinements, search depth, duration, and length. Directed queries were longer (i.e., they had more words), while sessions with exploratory queries had more refinements and associated page views. By comparing search interactions at ICPSR to other natural language interactions in similar web search contexts, we conclude that dataset search at ICPSR is underutilized. We envision how alternative search paradigms, such as those enabled by recommender systems, can enhance dataset search.

     
    more » « less
    Free, publicly-accessible full text available March 28, 2025
  4. This dataset contains trace data describing user interactions with the Inter-university Consortium for Political and Social Research website (ICPSR). We gathered site usage data from Google Analytics. We focused our analysis on user sessions, which are groups of interactions with resources (e.g., website pages) and events initiated by users. ICPSR tracks a subset of user interactions (i.e., other than page views) through event triggers. We analyzed sequences of interactions with resources, including the ICPSR data catalog, variable index, data citations collected in the ICPSR Bibliography of Data-related Literature, and topical information about project archives. As part of our analysis, we calculated the total number of unique sessions and page views in the study period. Data in our study period fell between September 1, 2012, and 2016. ICPSR's website was updated and relaunched in September 2012 with new search functionality, including a Social Science Variables Database (SSVD) tool. ICPSR then reorganized its website and changed its analytics collection procedures in 2016, marking this as the cutoff date for our analysis. Data are relevant for two reasons. First, updates to the ICPSR website during the study period focused only on front-end design rather than the website's search functionality. Second, the core features of the website over the period we examined (e.g., faceted and variable search, standardized metadata, the use of controlled vocabularies, and restricted data applications) are shared with other major data archives, making it likely that the trends in user behavior we report are generalizable.
     
    more » « less
  5. Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter‐university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine‐readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot‐based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data. 
    more » « less
  6. Social scientists increasingly share data so others can evaluate, replicate, and extend their research. To understand the process of data discovery as a precursor to data use, we study prospective users’ interactions with archived data. We gathered data for 98,000 user sessions initiated at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). Our data reflect four years (2012-16) of users’ interactions with archival resources, including a data catalog, study-level metadata, variables, and publications that cite nearly 10,000 datasets. We constructed a network of user interactions linking website landing (e.g., site entrances) to exit pages, from which we identified three types of paths that users take through the research data archive: direct, orienting, and scenic. We also interpreted points of failure (e.g., drop-offs) and recurring behaviors (e.g., sensemaking) that support or impede data discovery along search paths. We articulate strategies that users adopt as they navigate data search and suggest ways to enhance the accessibility of data, metadata, and the systems that organize each. 
    more » « less
  7. Abstract

    Social media data offer a rich resource for researchers interested in public health, labor economics, politics, social behaviors, and other topics. However, scale and anonymity mean that researchers often cannot directly get permission from users to collect and analyze their social media data. This article applies the basic ethical principle of respect for persons to consider individuals’ perceptions of acceptable uses of data. We compare individuals’ perceptions of acceptable uses of other types of sensitive data, such as health records and individual identifiers, with their perceptions of acceptable uses of social media data. Our survey of 1018 people shows that individuals think of their social media data as moderately sensitive and agree that it should be protected. Respondents are generally okay with researchers using their data in social research but prefer that researchers clearly articulate benefits and seek explicit consent before conducting research. We argue that researchers must ensure that their research provides social benefits worthy of individual risks and that they must address those risks throughout the research process.

     
    more » « less
  8. Abstract

    Data archives are an important source of high-quality data in many fields, making them ideal sites to study data reuse. By studying data reuse through citation networks, we are able to learn how hidden research communities—those that use the same scientific data sets—are organized. This paper analyzes the community structure of an authoritative network of data sets cited in academic publications, which have been collected by a large, social science data archive: the Interuniversity Consortium for Political and Social Research (ICPSR). Through network analysis, we identified communities of social science data sets and fields of research connected through shared data use. We argue that communities of exclusive data reuse form “subdivisions” that contain valuable disciplinary resources, while data sets at a “crossroads” broadly connect research communities. Our research reveals the hidden structure of data reuse and demonstrates how interdisciplinary research communities organize around data sets as shared scientific inputs. These findings contribute new ways of describing scientific communities to understand the impacts of research data reuse.

     
    more » « less