Title: SCStory: Self-supervised and Continual Online Story Discovery
We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in realtime without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the storyindicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.  more » « less
Award ID(s):
1956151 1741317 1704532
Author(s) / Creator(s):
; ; ;
Corporate Creator(s):
Proc. 2023 The Web Conf. 
Publisher / Repository:
Date Published:
Edition / Version:
Page Range / eLocation ID:
1853 to 1864
Subject(s) / Keyword(s):
["text mining, Self-supervised and Continual Online Story Discovery, stream mining"]
Austin TX USA
Sponsoring Org:
National Science Foundation
  1. Proc. 2023 ACM SIGIR Int. Conf. on Research and Development in Information Retrieval (Ed.)
    Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings. 
    more » « less
  2. In March 2020, the global COVID-19 pandemic forced universities across the United States to immediately stop face-to-face activities and transition to virtual instruction. While this transition was not easy for anyone, the shift to online learning was especially difficult for STEM courses, particularly engineering, which has a strong practical/laboratory component. Additionally, underrepresented students (URMs) in engineering experienced a range of difficulties during this transition. The purpose of this paper is to highlight underrepresented engineering students’ experiences as a result of COVID-19. In particular, we aim to highlight stories shared by participants who indicated a desire to share their experience with their instructor. In order to better understand these experiences, research participants were asked to share a story, using the novel data collection platform SenseMaker, based on the following prompt: Imagine you are chatting with a friend or family member about the evolving COVID-19 crisis. Tell them about something you have experienced recently as an engineering student. Conducting a SenseMaker study involves four iterative steps: 1) Initiation is the process of designing signifiers, testing, and deploying the instrument; 2) Story Collection is the process of collecting data through narratives; 3) Sense-making is the process of exploring and analyzing patterns of the collection of narratives; and 4) Response is the process of amplifying positive stories and dampening negative stories to nudge the system to an adjacent possible (Van der Merwe et al. 2019). Unlike traditional surveys or other qualitative data collection methods, SenseMaker encourages participants to think more critically about the stories they share by inviting them to make sense of their story using a series of triads and dyads. After completing their narrative, participants were asked a series of triadic, dyadic, and sentiment-based multiple-choice questions (MCQ) relevant to their story. For one MCQ, in particular, participants were required to answer was “If you could do so without fear of judgment or retaliation, who would you share this story with?” and were given the following options: 1) Family 2) Instructor 3) Peers 4) Prefer not to answer 5) Other. A third of the participants indicated that they would share their story with their instructor. Therefore, we further explored this particular question. Additionally, this paper aims to highlight this subset of students whose primary motivation for their actions were based on Necessity. High-level qualitative findings from the data show that students valued Grit and Perseverance, recent experiences influenced their Sense of Purpose, and their decisions were majorly made based on Intuition. Chi-squared tests showed that there were not any significant differences between race and the desire to share with their instructor, however, there were significant differences when factoring in gender suggesting that gender has a large impact on the complexity of navigating school during this time. Lastly, ~50% of participants reported feeling negative or extremely negative about their experiences, ~30% reported feeling neutral, and ~20% reported feeling positive or extremely positive about their experiences. In the study, a total of 500 micro-narratives from underrepresented engineering students were collected from June – July 2020. Undergraduate and graduate students were recruited for participation through the researchers’ personal networks, social media, and through organizations like NSBE. Participants had the option to indicate who is able to read their stories 1) Everyone 2) Researchers Only, or 3) No one. This work presents qualitative stories of those who granted permission for everyone to read. 
    more » « less
  3. The ability to quickly learn fundamentals about a new infectious disease, such as how it is transmitted, the incubation period, and related symptoms, is crucial in any novel pandemic. For instance, rapid identification of symptoms can enable interventions for dampening the spread of the disease. Traditionally, symptoms are learned from research publications associated with clinical studies. However, clinical studies are often slow and time intensive, and hence delays can have dire consequences in a rapidly spreading pandemic like we have seen with COVID-19. In this article, we introduce SymptomID, a modular artificial intelligence–based framework for rapid identification of symptoms associated with novel pandemics using publicly available news reports. SymptomID is built using the state-of-the-art natural language processing model (Bidirectional Encoder Representations for Transformers) to extract symptoms from publicly available news reports and cluster-related symptoms together to remove redundancy. Our proposed framework requires minimal training data, because it builds on a pre-trained language model. In this study, we present a case study of SymptomID using news articles about the current COVID-19 pandemic. Our COVID-19 symptom extraction module, trained on 225 articles, achieves an F1 score of over 0.8. SymptomID can correctly identify well-established symptoms (e.g., “fever” and “cough”) and less-prevalent symptoms (e.g., “rashes,” “hair loss,” “brain fog”) associated with the novel coronavirus. We believe this framework can be extended and easily adapted in future pandemics to quickly learn relevant insights that are fundamental for understanding and combating a new infectious disease. 
    more » « less
  4. This study analyzes and compares how the digital semantic infrastructure of U.S. based digital news varies according to certain characteristics of the media outlet, including the community it serves, the content management system (CMS) it uses, and its institutional affiliation (or lack thereof). Through a multi-stage analysis of the actual markup found on news outlets’ online text articles, we reveal how multiple factors may be limiting the discoverability and reach of online media organizations focused on serving specific communities. Conceptually, we identify markup and metadata as aspects of the semantic infrastructure underpinning platforms’ mechanisms of distributing online news. Given the significant role that these platforms play in shaping the broader visibility of news content, we further contend that this markup therefore constitutes a kind of infrastructure of visibility by which news sources and voices are rendered accessible—or, conversely—invisible in the wider platform economy of journalism. We accomplish our analysis by first identifying key forms of digital markup whose structured data is designed to make online news articles more readily discoverable by search engines and social media platforms. We then analyze 2,226 digital news stories gathered from the main pages of 742 national, local, Black, and other identity-based news organizations in mid-2021, and analyze each for the presence of specific tags reflecting the, OpenGraph, and Twitter metadata structures. We then evaluate the relationship between audience focus and the robustness of this digital semantic infrastructure. While we find only a weak relationship between the markup and the community served, additional analysis revealed a much stronger association between these metadata tags and content management system (CMS), in which 80% of the attributes appearing on an article were the same for a given CMS, regardless of publisher, market, or audience focus. Based on this finding, we identify the organizational characteristics that may influence the specific CMS used for digital publishing, and, therefore, the robustness of the digital semantic infrastructure deployed by the organization. Finally, we reflect on the potential implications of the highly disparate tag use we observe, particularly with respect to the broader visibility of online news designed to serve particular US communities. 
    more » « less
  5. Abstract

    What if we used the stories that researchers and practitioners tell each other as tools to advance interdisciplinary disaster research? This article hypothesizes that doing so could foster a new mode of collaborative learning and discovery. People, including researchers, regularly tell stories to relate “what happened” based on their experience, often in ways that augment or contradict existing understandings. These stories provide naturalistic descriptions of context, complexity, and dynamic relationships in ways that formal theories, static data, and interpretations of findings can miss. They often do so memorably and engagingly, which makes them beneficial to researchers across disciplines and allows them to be integrated into their own work. Seeking out, actively inviting, sharing, and discussing these stories in interdisciplinary teams that have developed a strong sense of trust can therefore provide partial escape from discipline‐specific reasoning and frameworks that are so often unconsciously employed. To develop and test this possibility, this article argues that the diverse and rapidly growing hazards and disaster field needs to incorporate a basic theoretical understanding of stories, building from folkloristics and other sources. It would also need strategies to draw out and build from stories in suitable interdisciplinary research forums and, in turn, to find ways to incorporate the discussions that emanate from stories into ongoing analyses, interpretations, and future lines of interdisciplinary inquiry.

    more » « less