We model coordination and coregulation patterns in 33 triads engaged in collaboratively solving a challenging computer programming task for approximately 20 minutes. Our goal is to prospectively model speech rate (words/sec) – an important signal of turn taking and active participation – of one teammate (A or B or C) from time lagged nonverbal signals (speech rate and acoustic-prosodic features) of the other two (i.e., A + B → C; A + C → B; B + C → A) and task-related context features. We trained feed-forward neural networks (FFNNs) and long short- term memory recurrent neural networks (LSTMs) using group- level nested cross-validation. LSTMs outperformed FFNNs and a chance baseline and could predict speech rate up to 6s into the future. A multimodal combination of speech rate, acoustic- prosodic, and task context features outperformed unimodal and bimodal signals. The extent to which the models could predict an individual’s speech rate was positively related to that individual’s scores on a subsequent posttest, suggesting a link between coordination/coregulation and collaborative learning outcomes. We discuss applications of the models for real-time systems that monitor the collaborative process and intervene to promote positive collaborative outcomes.
more »
« less
This content will become publicly available on January 1, 2026
Propositional Extraction from Collaborative Naturalistic Dialogues
In the realm of collaborative learning, extracting the beliefs shared within a group is a critical capability to navigate complex tasks. Inherent in this problem is the fact that in naturalistic collaborative discourse, the same propositional content may be expressed in radically different ways. This difficulty is exacerbated when speech overlaps and other communicative modalities are used, as would be the case in a co-situated collaborative task. In this paper, we conduct a comparative methodological analysis of extraction techniques for task-relevant propositions from natural speech dialogues in a challenging shared task setting where participants collaboratively determine the weights of five blocks using only a balance scale. We encode utterances and candidate propositions through language models and compare a cross-encoder method, adapted from coreference research, to a vector similarity baseline. Our cross-encoder approach outperforms both a cosine similarity baseline and zero-shot inference by both the GPT-4 and LLaMA 2 language models, and we establish a novel baseline on this challenging task on two collaborative task datasets---the Weights Task and DeliData---showing the generalizability of our approach. Furthermore, we explore the use of state of the art large language models for data augmentation to enhance performance, extend our examination to transcripts generated by Google's Automatic Speech Recognition system to assess the potential for automating the propositional extraction process in real-time, and introduce a framework for live propositional extraction from natural speech and multimodal signals. This study not only demonstrates the feasibility of detecting collaboration-relevant content in unstructured interactions but also lays the groundwork for employing AI to enhance collaborative problem-solving in classrooms, and other collaborative settings, such as the workforce. Our code may be found at: (https://github.com/csu-signal/PropositionExtraction).
more »
« less
- Award ID(s):
- 2019805
- PAR ID:
- 10588539
- Publisher / Repository:
- Zenodo
- Date Published:
- Journal Name:
- Journal of educational data mining
- ISSN:
- 2157-2100
- Subject(s) / Keyword(s):
- collaborative problem solving propositional extraction natural speech natural language processing dialogue analysis
- Format(s):
- Medium: X
- Right(s):
- Creative Commons Attribution 4.0 International
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Podcasts have become daily companions for half a billion users. Given the enormous amount of podcast content available, highlights provide a valuable signal that helps viewers get the gist of an episode and decide if they want to invest in listening to it in its entirety. However, identifying highlights automatically is challenging due to the unstructured and long-form nature of the content. We introduce Rhapsody, a dataset of 13K podcast episodes paired with segment-level highlight scores derived from YouTube's 'most replayed' feature. We frame the podcast highlight detection as a segment-level binary classification task. We explore various baseline approaches, including zero-shot prompting of language models and lightweight finetuned language models using segment-level classification heads. Our experimental results indicate that even state-of-the-art language models like GPT-4o and Gemini struggle with this task, while models finetuned with in-domain data significantly outperform their zero-shot performance. The finetuned model benefits from leveraging both speech signal features and transcripts. These findings highlight the challenges for fine-grained information access in long-form spoken media.more » « less
-
Decoding human speech from neural signals is essential for brain–computer interface (BCI) technologies that aim to restore speech in populations with neurological deficits. However, it remains a highly challenging task, compounded by the scarce availability of neural signals with corresponding speech, data complexity and high dimensionality. Here we present a novel deep learning-based neural speech decoding framework that includes an ECoG decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters and a novel differentiable speech synthesizer that maps speech parameters to spectrograms. We have developed a companion speech-to-speech auto-encoder consisting of a speech encoder and the same speech synthesizer to generate reference speech parameters to facilitate the ECoG decoder training. This framework generates natural-sounding speech and is highly reproducible across a cohort of 48 participants. Our experimental results show that our models can decode speech with high correlation, even when limited to only causal operations, which is necessary for adoption by real-time neural prostheses. Finally, we successfully decode speech in participants with either left or right hemisphere coverage, which could lead to speech prostheses in patients with deficits resulting from left hemisphere damage.more » « less
-
We investigated the generalizability of language-based analytics models across two collaborative problem solving (CPS) tasks: an educational physics game and a block programming challenge. We analyzed a dataset of 95 triads (N=285) who used videoconferencing to collaborate on both tasks for an hour. We trained supervised natural language processing classifiers on automatic speech recognition transcripts to predict the human-coded CPS facets (skills) of constructing shared knowledge, negotiation / coordination, and maintaining team function. We tested three methods for representing collaborative discourse: (1) deep transfer learning (using BERT), (2) n-grams (counts of words/phrases), and (3) word categories (using the Linguistic Inquiry Word Count [LIWC] dictionary). We found that the BERT and LIWC methods generalized across tasks with only a small degradation in performance (Transfer Ratio of .93 with 1 indicating perfect transfer), while the n-grams had limited generalizability (Transfer Ratio of .86), suggesting overfitting to task-specific language. We discuss the implications of our findings for deploying language-based collaboration analytics in authentic educational environments.more » « less
-
Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on “dialogue state tracking” (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is “common ground tracking” (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and “questions under discussion” (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.more » « less
An official website of the United States government
