skip to main content


Title: A deep dive into microphone hardware for recording collaborative group work
Classroom environments are challenging for artificially intelligent agents primarily because classroom noise dilutes the interpretability and usefulness of gathered data. This problem is exacerbated when groups of students participate in collaborative problem solving (CPS). Here, we examine how well six popular microphones capture audio from individual groups. A primary usage of audio data is automatic speech recognition (ASR), therefore we evaluate our recordings by examining the accuracy of downstream ASR using the Google Cloud Platform. We simultaneously captured the audio of all microphones for 11 unique groups of three participants first reading a prepared script, and then participating in a collaborative problem solving exercise. We vary participants, noise conditions, and speech contexts. Transcribed speech was evaluated using word error rate (WER). We find that scripted speech is transcribed with a surprisingly high degree of accuracy across groups (average WER = 0.114, SD = 0.044). However, the CPS task was much more difficult (average WER = 0.570, SD = 0.143). We found most microphones were robust to background noise below a certain threshold, but the AT-Cardioid and ProCon microphones were more robust to higher noise levels. Finally, an analysis of errors revealed that most errors were due to the ASR missing words/phrases, rather than mistranscribing them. We conclude with recommendations based on our observations.  more » « less
Award ID(s):
2019805
NSF-PAR ID:
10497216
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Mitrovic, Antonija; Bosch, Nigel
Publisher / Repository:
Zenodo
Date Published:
Journal Name:
Educational Data Mining Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings. 
    more » « less
  2. Abstract  
    more » « less
  3. We investigated the feasibility of using automatic speech recognition (ASR) and natural language processing (NLP) to classify collaborative problem solving (CPS) skills from recorded speech in noisy environments. We analyzed data from 44 dyads of middle and high school students who used videoconferencing to collaboratively solve physics and math problems (35 and 9 dyads in school and lab environments, respectively). Trained coders identified seven cognitive and social CPS skills (e.g., sharing information) in 8,660 utterances. We used a state-of-theart deep transfer learning approach for NLP, Bidirectional Encoder Representations from Transformers (BERT), with a special input representation enabling the model to analyze adjacent utterances for contextual cues. We achieved a microaverage AUROC score (across seven CPS skills) of .80 using ASR transcripts, compared to .91 for human transcripts, indicating a decrease in performance attributable to ASR error. We found that the noisy school setting introduced additional ASR error, which reduced model performance (micro-average AUROC of .78) compared to the lab (AUROC = .83). We discuss implications for real-time CPS assessment and support in schools. 
    more » « less
  4. INTRODUCTION: Apollo-11 (A-11) was the first manned space mission to successfully bring astronauts to the moon and return them safely. Effective team based communications is required for mission specialists to work collaboratively to learn, engage, and solve complex problems. As part of NASA’s goal in assessing team and mission success, all vital speech communications between these personnel were recorded using the multi-track SoundScriber system onto analog tapes, preserving their contribution in the success of one of the greatest achievements in human history. More than +400 personnel served as mission specialists/support who communicated across 30 audio loops, resulting in +9k hours of data for A-11. To ensure success of this mission, it was necessary for teams to communicate, learn, and address problems in a timely manner. Previous research has found that compatibility of individual personalities within teams is important for effective team collaboration of those individuals. Hence, it is essential to identify each speaker’s role during an Apollo mission and analyze group communications for knowledge exchange and problem solving to achieve a common goal. Assessing and analyzing speaker roles during the mission can allow for exploring engagement analysis for multi-party speaker situations. METHOD: The UTDallas Fearless steps Apollo data is comprised of 19,000 hours (A-11,A-13,A-1) possessing unique and multiple challenges as it is characterized by severe noise and degradation as well as overlap instances over the 30 channels. For our study, we have selected a subset of 100 hours manually transcribed by professional annotators for speaker labels. The 100 hours are obtained from three mission critical events: 1. Lift-Off (25 hours) 2. Lunar-Landing (50 hours) 3. Lunar-Walking (25 hours). Five channels of interest, out of 30 channels were selected with the most speech activity, the primary speakers operating these five channels are command/owners of these channels. For our analysis, we select five speaker roles: Flight Director (FD), Capsule Communicator (CAPCOM), Guidance, Navigation and, Control (GNC), Electrical, environmental, and consumables manager (EECOM), and Network (NTWK). To track and tag individual speakers across our Fearless Steps audio dataset, we use the concept of ‘where’s Waldo’ to identify all instances of our speakers-of-interest across a cluster of other speakers. Also, to understand speaker roles of our speaker-of-interests, we use speaker duration of primary speaker vs secondary speaker and speaker turns as our metrics to determine the role of the speaker and to understand their responsibility during the three critical phases of the mission. This enables a content linking capability as well as provide a pathway to analyzing group engagement, group dynamics of people working together in an enclosed space, psychological effects, and cognitive analysis in such individuals. IMPACT: NASA’s Apollo Program stands as one of the most significant contributions to humankind. This collection opens new research options for recognizing team communication, group dynamics, and human engagement/psychology for future deep space missions. Analyzing team communications to achieve such goals would allow for the formulation of educational and training technologies for assessment of STEM knowledge, task learning, and educational feedback. Also, identifying these personnel can help pay tribute and yield personal recognition to the hundreds of notable engineers and scientist who made this feat possible. ILLUSTRATION: In this work, we propose to illustrate how a pre-trained speech/language network can be used to obtain powerful speaker embeddings needed for speaker diarization. This framework is used to build these learned embeddings to label unique speakers over sustained audio streams. To train and test our system, we will make use of Fearless Steps Apollo corpus, allowing us to effectively leverage a limited label information resource (100 hours of labeled data out of +9000 hours). Furthermore, we use the concept of 'Finding Waldo' to identify key speakers of interest (SOI) throughout the Apollo-11 mission audio across multiple channel audio streams. 
    more » « less
  5. Speech and language development in children are crucial for ensuring effective skills in their long-term learning ability. A child’s vocabulary size at the time of entry into kindergarten is an early indicator of their learning ability to read and potential long-term success in school. The preschool classroom is thus a promising venue for assessing growth in young children by measuring their interactions with teachers as well as classmates. However, to date limited studies have explored such naturalistic audio communications. Automatic Speech Recognition (ASR) technologies provide an opportunity for ’Early Childhood’ researchers to obtain knowledge through automatic analysis of naturalistic classroom recordings in measuring such interactions. For this purpose, 208 hours of audio recordings across 48 daylong sessions are collected in a childcare learning center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. Approximately 29 hours of adult speech and 26 hours of child speech is segmented using manual transcriptions provided by CRSS transcription team. Traditional as well as End-to-End ASR models are trained on adult/child speech data subset. Factorized Time Delay Neural Network provides a best Word-Error-Rate (WER) of 35.05% on the adult subset of the test set. End-to-End transformer models achieve 63.5% WER on the child subset of the test data. Next, bar plots demonstrating the frequency of WH-question words in Science vs. Reading activity areas of the preschool are presented for sessions in the test set. It is suggested that learning spaces could be configured to encourage greater adult-child conversational engagement given such speech/audio assessment strategies. 
    more » « less