skip to main content


Title: WikiConv: A Corpus of the Complete Conversational History of a Large Online Collaborative Community
We present a corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations—including not only comments and replies, but also their modifications, deletions and restorations—this data offers an unprecedented view of online conversation. This level of detail supports new research questions pertaining to the process (and challenges) of large-scale online collaboration. We illustrate the corpus’ potential with two case studies that highlight new perspectives on earlier work. First, we explore how a person’s conversational behavior depends on how they relate to the discussion’s venue. Second, we show that community moderation of toxic behavior happens at a higher rate than previously estimated. Finally the reconstruction framework is designed to be language agnostic, and we show that it can extract high quality conversational data in both Chinese and English.  more » « less
Award ID(s):
1750615 1741441
NSF-PAR ID:
10098199
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the Conference on Empirical Methods in Natural Language Processing
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Online conversations can go in many directions: some turn out poorly due to antisocial behavior, while others turn out positively to the benefit of all. Research on improving online spaces has focused primarily on detecting and reducing antisocial behavior. Yet we know little about positive outcomes in online conversations and how to increase them—is a prosocial outcome simply the lack of antisocial behavior or something more? Here, we examine how conversational features lead to prosocial outcomes within online discussions. We introduce a series of new theory-inspired metrics to define prosocial outcomes such as mentoring and esteem enhancement. Using a corpus of 26M Reddit conversations, we show that these outcomes can be forecasted from the initial comment of an online conversation, with the best model providing a relative 24% improvement over human forecasting performance at ranking conversations for predicted outcome. Our results indicate that platforms can use these early cues in their algorithmic ranking of early conversations to prioritize better outcomes. 
    more » « less
  2. This study investigates the presence of dynamical patterns of interpersonal coordination in extended deceptive conversations across multimodal channels of behavior. Using a novel "devil’s advocate" paradigm, we experimentally elicited deception and truth across topics in which conversational partners either agreed or disagreed, and where one partner was surreptitiously asked to argue an opinion opposite of what he or she really believed. We focus on interpersonal coordination as an emergent behavioral signal that captures interdependencies between conversational partners, both as the coupling of head movements over the span of milliseconds, measured via a windowed lagged cross correlation (WLCC) technique, and more global temporal dependencies across speech rate, using cross recurrence quantification analysis (CRQA). Moreover, we considered how interpersonal coordination might be shaped by strategic, adaptive conversational goals associated with deception. We found that deceptive conversations displayed more structured speech rate and higher head movement coordination, the latter with a peak in deceptive disagreement conversations. Together the results allow us to posit an adaptive account, whereby interpersonal coordination is not beholden to any single functional explanation, but can strategically adapt to diverse conversational demands. 
    more » « less
  3. Understanding and characterizing how people interact in information-seeking conversations will be a crucial component in developing effective conversational search systems. In this paper, we introduce a new dataset designed for this purpose and use it to analyze information-seeking conversations by user intent distribution, co-occurrence, and flow patterns. The MSDialog dataset is a labeled conversation dataset of question answering (QA) interactions between information seekers and providers from an online forum on Microsoft products. The dataset contains more than 2,000 multi-turn QA dialogs with 10,000 utterances that are annotated with user intents on the utterance level. Annotations were done using crowdsourcing. With MSDialog, we find some highly recurring patterns in user intent during an information-seeking process. They could be useful for designing conversational search systems. We will make our dataset freely available to encourage exploration of information-seeking conversation models. 
    more » « less
  4. Advancing speech emotion recognition (SER) de- pends highly on the source used to train the model, i.e., the emotional speech corpora. By permuting different design parameters, researchers have released versions of corpora that attempt to provide a better-quality source for training SER. In this work, we focus on studying communication modes of collection. In particular, we analyze the patterns of emotional speech collected during interpersonal conversations or monologues. While it is well known that conversation provides a better protocol for eliciting authentic emotion expressions, there is a lack of systematic analyses to determine whether conversational speech provide a “better-quality” source. Specifically, we examine this research question from three perspectives: perceptual differences, acoustic variability and SER model learning. Our analyses on the MSP- Podcast corpus show that: 1) rater’s consistency for conversation recordings is higher when evaluating categorical emotions, 2) the perceptions and acoustic patterns observed on conversations have properties that are better aligned with expected trends discussed in emotion literature, and 3) a more robust SER model can be trained from conversational data. This work brings initial evidences stating that samples of conversations may provide a better-quality source than samples from monologues for building a SER model. 
    more » « less
  5. Abstract This article articulates conceptual and methodological strategies for studying the dynamic structure of dyadic interaction revealed by the turn-to-turn exchange of messages between partners. Using dyadic time series data that capture partners’ back-and-forth contributions to conversations, dynamic dyadic systems analysis illuminates how individuals act and react to each other as they jointly construct conversations. Five layers of inquiry are offered, each of which yields theoretically relevant information: (a) identifying the individual moves and dyadic spaces that set the stage for dyadic interaction; (b) summarizing conversational units and sequences; (c) examining between-dyad differences in overall conversational structure; (d) describing the temporal evolution of conversational units and sequences; and (e) mapping within-dyad dynamics of conversations and between-dyad differences in those dynamics. Each layer of analysis is illustrated using examples from research on supportive conversations, and the application of dynamic dyadic systems analysis to a range of interpersonal communication phenomena is discussed. 
    more » « less