<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>10/09/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10496733</idno>
					<idno type="doi">10.1145/3577190.3614127</idno>
					<title level='j'>Proceedings of the 25th International Conference on Multimodal Interaction</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yingbo Ma</author><author>Mehmet Celepkolu</author><author>Kristy Elizabeth Boyer</author><author>Collin F. Lynch</author><author>Eric Wiebe</author><author>Maya Israel</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Collaborative learning refers to two or more learners working together to solve a problem, complete a task, or create a shared product <ref type="bibr">[23]</ref>. During collaborative activities, learners face cognitive and social challenges and require timely support <ref type="bibr">[33]</ref>. One solution for providing such support is to develop intelligent systems for collaborative learning, which can scafold productive learning and support efective collaboration among learners <ref type="bibr">[27]</ref>. These systems rely on accurate modeling of learners' collaborative interactions using their behavioral data from multiple modalities, such as language, audio, and video <ref type="bibr">[5]</ref>. Each modality provides unique insights into the collaborative learning process, and combining them may lead to improved accuracy in modeling collaborative learning <ref type="bibr">[32]</ref>.</p><p>In a traditional classroom setting, learners engage in conversation via verbal communication (e.g., verbal afrmations) and through para-verbal cues (e.g., change in speech prosody, facial expressions) <ref type="bibr">[20]</ref>. However, noise that often accompanies these behavioral data poses a great challenge for intelligent systems to analyze and understand learners' dialogues <ref type="bibr">[22,</ref><ref type="bibr">36]</ref>. For example, it is difcult to separate learners' speech from background audio noise given the presence of ambient sound or chatter in the surroundings <ref type="bibr">[26]</ref>. Furthermore, even when learners' speech can be successfully separated from background audio noise, language errors can still arise because of the imperfect speech recognition from dropped audio, misunderstanding of words due to slang, dialect, as well as young learners' distinct vocal characteristics <ref type="bibr">[4,</ref><ref type="bibr">46]</ref>. In addition, learners' faces may be mistracked when they are not directly facing the camera or in the case of occlusion <ref type="bibr">[14,</ref><ref type="bibr">48]</ref>. When intelligent systems integrate these noisy data, they may not accurately model the complex dynamics of learners' collaboration processes. By understanding the extent to which these data noise afect modeling accuracy, we can obtain valuable insights for implementing intelligent systems that model collaborative behavior in actual classroom settings.</p><p>This paper takes a frst step toward addressing the challenge of modeling collaborative learning behavior from error-prone data streams by investigating the impact of language error, audio noise, and facial mistracking on detecting two important moments during collaborative learning: confusion and confict. We chose to model confusion and confict because they represent critical moments when learners face cognitive and social challenges during collaborative learning. Confusion is an important cognitive-afective state that may emerge when learners face cognitive challenges during collaborative learning <ref type="bibr">[38]</ref>; confict is another which captures a state of disagreement or opposition between two or more learners <ref type="bibr">[1]</ref>. By automatically detecting confict and confusion during collaboration, intelligent systems can ofer timely assistance to help learners work through the confusion and confict and improve their learning experience.</p><p>Our dataset includes the audio and video recordings of 25 paired elementary school learners working on a series of coding tasks. Specifcally, we investigate two research questions (RQs):</p><p>&#8226; RQ 1: To what extent do language errors, audio noise, and facial mistracking impact the accuracy of modeling confusion and confict during collaborative learning? &#8226; RQ 2: To what extent does integrating language, audio, and facial behaviors compensate for the negative impact of unimodal data noise?</p><p>We measure language errors with word error rate ( ) <ref type="bibr">[6]</ref>, audio noise with posterior speech-to-noise ratio ( ) <ref type="bibr">[47]</ref>, and video errors with the number of mistracked faces. To answer RQ1, we compared the performance of text language, audio, and facial behaviors with the error metrics described above. We found a threshold of around 20% in text language, where the model accuracy for detecting confusion and confict started to decrease drastically (overall accuracy from 0.84 to 0.73), with F-1 scores dropping from 0.55 to 0.40 for confusion and dropping from 0.53 to 0.37 for confict.</p><p>For audio, the model's accuracy decreased sharply from 0.79 to 0.61 when the signal-to-noise ratio ( ) dropped below 5 dB, with F-1 scores dropping from 0.55 to 0.40 for confusion and dropping from 0.53 to 0.37 for confict. For facial behaviors, compared to when both learners' faces were tracked, where the accuracy was 0.79, the model's accuracy slightly decreased to 0.71 when only the listener's face was tracked, with F-1 scores of 0.51 to 0.44 for confusion and 0.41 to 0.33 for confict. To answer RQ2, we compared the performance of several multimodal models trained with model-agnostic and model-based fusion methods, and we found that combining multimodal data from language, audio, and facial features can efectively compensate for the negative impact of unimodal data noise, ultimately leading to improved accuracy in recognizing confusion and confict.</p><p>To the best of our knowledge, this work is the frst to investigate the impact of data noise on the multimodal modeling of collaborative dialogues. The contributions of this paper are twofold: First, we provide extensive experimental results demonstrating the impact of language errors, audio noise, and facial mistracking errors on modeling collaborative dialogues. Second, we provide empirical evidence for the combination of multimodal data as an efective means of compensating for the negative impact of data noise present in a single modality. These fndings provide valuable insights for the future implementation of these models in actual classroom settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK 2.1 Data Noise</head><p>Noisy data may occur due to dynamic environmental conditions, faulty detectors, or other unavoidable quality degradation in the measurements <ref type="bibr">[13]</ref>. The impact of data noise has been studied in various domains, such as audio-visual speech recognition <ref type="bibr">[36]</ref>, risk prediction <ref type="bibr">[15]</ref>, object tracking <ref type="bibr">[16]</ref>, and medical diagnosis <ref type="bibr">[59]</ref>. For example, in the domain of audio-visual speech recognition, Papandreou et al. <ref type="bibr">[36]</ref> studied the impact of data noise under challenging conditions, such as when the visual front-end momentarily mistracks the speaker's face or in noisy acoustic environments. In the feld of risk prediction, Heo et al. <ref type="bibr">[15]</ref> investigated the efects of data noise on the clinical risk prediction of electronic health records, which frequently sufer from varying degrees of noise and missing entry problems. Medical diagnosis is yet another domain that is prone to data noise, as highlighted by Reyes-Garcia et al. <ref type="bibr">[40]</ref>, who evaluated the impact of missing values of vital biometrics, such as arterial blood pressure and heart rate, on the early prediction of patients' physiological deterioration. The results of these studies suggest that data noise can negatively impact data analysis and lead to incorrect conclusions, unreliable predictions, or fawed decision-making, and it is important to develop efective methods to handle data noise to ensure the accuracy and reliability of the results obtained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Multimodal Modeling of Collaborative Learning</head><p>Prior research on multimodal modeling of collaborative learning has analyzed learners' interactions across multiple modalities, with the combination of speech, facial expressions, body gestures, and physiological data <ref type="bibr">[25,</ref><ref type="bibr">51,</ref><ref type="bibr">58]</ref>. Analyzing collaborative learning processes by combining multiple modalities of data has shown great promise in building more accurate models. Olsen et al. <ref type="bibr">[34]</ref> combined learners' audio, eye gaze, and tutor logs to predict collaborative learning outcomes. Vrzakova et al. <ref type="bibr">[51]</ref> analyzed multimodal data, including screen capture, speech, and body movements as triads engaged in a collaborative programming task. Moulder et al. <ref type="bibr">[30]</ref> modeled how students' multimodal dynamics (e.g., emotional, verbal communication, physiological) were infuenced by each other while engaged in collaborative problem-solving. Eloy et al. <ref type="bibr">[9]</ref> used speech rate, body movement, and galvanic skin response to model triads' emotional valence and task performance while collaborating on solving physics games. Although the above-mentioned literature has highlighted the improved accuracy of multimodal modeling over unimodal modeling, the impact of noise and missing values in diferent modalities, and their collective impact when multimodal data is integrated, has not been investigated in research on modeling collaborative learning, and our study aims to fll this gap.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DATASET 3.1 Participants and Collaborative Learning Tasks</head><p>Our dataset consists of audio and video data from 25 pairs in fourth-grade classrooms in an elementary school in the southeastern United States. The dataset was collected in the spring of 2022. These learners had an average age of 10, with 21 of them reporting their gender as girls, 12 as boys, and fve preferring not to answer. Learners collaborated on a series of coding activities in which they learned fundamental CS concepts such as variables, conditionals, and loops using a block-based learning environment built upon Snap! <ref type="bibr">[45]</ref>. The learners followed the pair programming paradigm, in which each pair (or dyad) shared one computer and switched roles between "driver" and "navigator" during the sciencesimulation coding activity. The driver is responsible for writing the code and implementing the solution, while the navigator provides support by catching mistakes and providing feedback on the in-progress solution <ref type="bibr">[7]</ref> (See Fig. <ref type="figure">1-Top)</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Manual Annotation of Confusion and Confict Dialogues</head><p>In line with prior work on analyzing confusion and confict dialogues during collaborative learning <ref type="bibr">[42,</ref><ref type="bibr">50]</ref>, the process of annotating confusion and confict was based on textual transcripts with video used in rare cases of unresolvable ambiguity within the transcripts. The dialogue act taxonomy draws upon a closely related dialogue act taxonomy by Zakaria et al. <ref type="bibr">[57]</ref> that was designed for elementary school learners' classroom dialogues. Table <ref type="table">1</ref> shows example excerpts and descriptions for confusion and confict.</p><p>To establish the reliability of the dialogue act labeling, two annotators frst engaged in a training phase where they collaboratively applied the dialogue act taxonomy and discussed any disagreements. Once training was complete, they independently tagged an overlapping 20% of the data, reaching a Cohen's kappa score of 0.816, indicating a strong agreement. They then proceeded to divide and tag the remaining data independently. Among a total of 9,943 transcribed utterances, 467 (4.7%) were labeled as confusion, 924 (9.3%) as confict, and 8,552 (86.0%) other.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DATA NOISE MEASUREMENT 4.1 Errors in Text Language</head><p>We measured noise in language by word error rate ( ) <ref type="bibr">[43]</ref>. WER is given by = ( + + )/ , where is the number of substitutions, is the number of deletions, is the number of insertions, and is the number of words in the human transcript.</p><p>For each human-transcribed utterance, additive noise was manually generated by randomly substituting, deleting, or inserting words, to create fve diferent noisy transcripts with a of 0.1, 0.2, 0.3, 0.4, and 0.5. In this study, we initially tested several transcription engines in an efort to obtain WERs low enough for our study. We experimented with both commercial engines and open-source engines (e.g., Google Speech-to-Text <ref type="bibr">[49]</ref> and OpenAI Whisper <ref type="bibr">[55]</ref>), and these models all generated overall high WERs (above 0.70). Consequently, in order to investigate data with lower WER, we decided to manually introduce a controlled amount of textual noise. The manual perturbation steps for each utterance are: (1) randomly select an action (i.e., substitution, deletion, or insertion),</p><p>(2) randomly select a word within the given utterance, (3) if the action is deletion, delete the word selected from the last step; if the action is substitution, randomly select another word in the given utterance, then substitute the word with the word selected from the last step; if the action is insertion, randomly select a position in the given utterance, then insert the word selected from the last step to the selected position.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Noise in Audio</head><p>We measured noise in the audio modality by using the posterior signal-to-noise ratio ( ) <ref type="bibr">[47]</ref>, which is the ratio of signal power to noise power, often expressed in decibels (dBs). An value of 0 dB indicates the same power of signal and noise. We calculated</p><p>, where () is the energy of noisy speech of audio frame , and () is the energy of noise of frame . We estimated () by averaging the () of each silent audio frame (identifed by Silero <ref type="bibr">[44]</ref>, a pre-trained voice activity detection model), then used the averaged () to calculate () for each speech audio segment (an of +15 dB or above indicates good speech quality). The overall average of our dataset is +1.3 dB. Table <ref type="table">2</ref> shows the number of diferent audio segments in the dataset that fall into diferent classes. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Mistracked Faces in Video</head><p>We partitioned the face recognition data into segments and classifed the segments into one of four cases: (1) Both learners' faces were tracked. This is the optimal condition; at the beginning of each dyad's learning session, their laptop was set in the middle of them to track both their faces. (2) Only the speaker's face was tracked. This error may happen when learners adjusted the direction of the laptop or their body positions during the collaboration process.</p><p>(3) Only the listener's face was tracked. This error may happen due to the same reason as condition 2. (4) Neither of the learners' faces was tracked. This error may happen due to both learners being out of the camera (e.g., disengaged and talking to other classmates) or the presence of occlusion. We used the OpenFace 2.0 facial behavior analysis toolkit <ref type="bibr">[35]</ref> to count how many times each of the four conditions occur in our dataset. OpenFace supports automatic facial recognition by generating the number of detected face_ids as well as the location of the head in the horizontal axis pose_Tx for every video frame. For condition 1, the number of face_ids is 2; For conditions 2 and condition 3, the number of face_ids is 1; for condition 4, the number of face_ids is 0. We then used the pose_Tx to diferentiate condition 2 and condition 3. Table <ref type="table">3</ref> shows the number of video segments in the dataset that fall into each face-tracking condition. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">FEATURE EXTRACTION 5.1 Language Features</head><p>To extract linguistic features, we represented each spoken utterance with TF * IDF, BERT, and RoBERTa. Given that signal words or phrases (e.g., "confuse", "do not", "not know") appear frequently when learners express their confusion or confict, we generated Tf * IDF embeddings. We also used BERT and RoBERTa to generate the semantic embedding of each utterance. RoBERTa is a variant of BERT trained on longer sequences. RoBERTa dynamically changes the masking pattern applied to the training data and has outperformed BERT on a series of language processing tasks, such as machine translation and question answering <ref type="bibr">[24]</ref>. We used the Hugging Face bert-base-uncased <ref type="bibr">[3]</ref> and xlm-roberta-based <ref type="bibr">[41]</ref> models to generate a 768-dimensional language embedding for each utterance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Audio Features</head><p>We used openSMILE, an open-source automatic acoustic feature extraction toolkit <ref type="bibr">[11]</ref> for extracting acoustic-prosodic indicators with a 20ms frame and a window shift of 10ms. For each 20 ms of audio, 25 eGeMAPS <ref type="bibr">[10]</ref> low-level descriptors (LLDs) were generated, including loudness (1 feature), pitch (10 features), and mel frequency cepstral coefcients (MFCCs) (4 features). Following an established downsampling strategy <ref type="bibr">[37]</ref>, we averaged LLDs every 2 consecutive frames. Apart from acoustic-prosodic LLDs, we also experimented with Wav2Vec, a transformer-based audio embedding network that has shown state-of-the-art performance on a series of speech-related tasks, such as speech recognition <ref type="bibr">[52]</ref> and speech emotion recognition <ref type="bibr">[37]</ref>. We used a Wav2Vec-based [54] model to generate a 768-dimensional audio embedding for each audio segment. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Video Features</head><p>In this paper, we used OpenFace, which supports accurate facial landmark detection, head pose estimation, eye-gaze direction estimation, and facial action unit (AU) recognition for videos containing a single face or multiple faces <ref type="bibr">[29]</ref>. We used the multiple faces mode to extract three visual features generated from the video modality: eye gaze, head pose, and facial action units (AUs). In each detected face in each video frame, OpenFace generated a 120-dimensional eye gaze vector (112 eye landmarks, 6 eye direction vectors, 2 eye direction in radius), a 6-dimensional head position vector which represents the location of the head with respect to the camera, and a 35-dimensional facial AU vector, including 17 facial AU intensity (0 to 5) features and 18 facial AU presence (0-absence or 1-presence) features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EXPERIMENTS AND RESULTS</head><p>Given that the feature space (relative to the dataset used in this study) is large, we frst reduced the feature space by identifying and removing weakly relevant or irrelevant features. To do this, we provided every feature extracted from language, video, and video modalities as the input to a multilayer-perceptron (MLP) classifer with an embedding size of 128 for two linear layers; two dropout layers with a rate of 0.5 were added to each linear layer to alleviate over-ftting. The Softmax activation function was used in the last output layer to output the classifcation results. We used the synthetic minority oversampling technique (SMOTE) <ref type="bibr">[8]</ref> to mitigate the negative infuence of the imbalanced label distribution of confusion (4.7%) and confict (9.3%) within our dataset. SMOTE was only performed on the training set, and class distributions for the validation and testing sets were left unchanged. We conducted fve-fold cross-validation to train and validate the models. We used an Adam optimizer <ref type="bibr">[21]</ref> with the learning rate of 1 &#215; -3 to train the classifcation model up to 100 epochs. We evaluated the trained model using an F-1 score <ref type="bibr">[12]</ref> combined from precision and recall for each one of the three classes. Table <ref type="table">4</ref> shows the confusion and confict classifcation performance trained on each of the single features. From the results in the table, we identifed the best-performing unimodal features in each modality: fne-tuned RoBERTa, Wav2Vec, and facial AUs in the language, audio, and video modality, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Investigating the Impact of Unimodal Data Noise</head><p>We frst consider RQ1: To what extent do language errors, audio noise, and facial mistracking impact the accuracy of modeling confusion and confict during collaborative learning? For diferent data noise levels within each modality, we built and compared the performance of several supervised unimodal models trained on the best-performing feature identifed in Table <ref type="table">4</ref>; we used the same experimental setup and training strategies as described above.</p><p>6.1.1 Impact of Word Error Rate. Figure <ref type="figure">2</ref> shows the performance of unimodal models using language-derived features under diferent levels. In the fgure, as increased from 0 to 0.2, the overall accuracy, as well as the F-1 scores for both confusion and confict, remained relatively stable. When increased from 0.2 to 0.3, the performance sufered a drastic degradation, with the overall accuracy decreasing from 0.84 to 0.73, and the F-1 scores decreasing from 0.55 to 0.40 for confusion and from 0.53 to 0.37 for confict. Figure <ref type="figure">2</ref>: Unimodal models using language-derived features 6.1.2 Impact of Audio Noise. Figure <ref type="figure">3</ref> shows the performance of unimodal models both trained and tested using audio-derived features under diferent levels. As shown in the fgure, the performance was very sensitive to noise, with performance decreasing sharply as the noise level increased. From 5 &lt; to 0 &lt; &lt; 5, overall accuracy decreased from 0.79 to 0.61; the F-1 scores for confusion decreased from 0.39 to 0.19, and for confict decreased from 0.48 to 0.36. The continued to decrease drastically as declined. Figure <ref type="figure">4</ref> shows the performance of unimodal models both trained and tested using video-derived features under diferent facial tracking conditions. We considered four conditions: (1) both learners' faces were tracked;</p><p>(2) only the speaker's face was tracked; (3) only the listener's face was tracked, and (4) neither of the learners' faces was tracked. As shown in the fgure, the model yielded the highest accuracy of 0.79 when both learners' faces were successfully tracked (condition 1). Surprisingly, the model performed comparably even when only the listener's face was tracked (condition 3), with a degraded accuracy of 0.71, and F-1 scores declined from 0.51 to 0.44 for confusion and 0.41 to 0.33 for confict between condition 1 and condition 3. We did not investigate condition 4 because there were no facial features generated by OpenFace when both learners' faces were missing. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Examining the Performance of Multimodal Modeling</head><p>We next consider RQ2: To what extent does integrating language, audio, and facial behaviors compensate for the negative impact of unimodal data noise? To answer this question, we built several supervised multimodal models trained on the fusion of best-performing features in the language, audio, and video modalities. Figure <ref type="figure">5</ref> shows the overview of the multimodal model architecture. The other multimodal models followed the same structure with a subset of the language, audio, and video modalities.</p><p>In this study, we experimented with two types of feature fusion methods <ref type="bibr">[2]</ref>: model-agnostic and model-based. For model-agnostic methods, we used early and late fusion. Early fusion concatenated unimodal features after applying z-score normalization. Late fusion trained separate models with separate unimodal features and calculated the numerical average of their outputs. For model-based methods, we experimented with two neural-network-based fusion approaches: tensor fusion <ref type="bibr">[56]</ref> and cross-attention fusion <ref type="bibr">[31]</ref>. Tensor fusion transforms multimodal features into a 3D feature tensor, while cross-attention fusion uses a shared transformer encoder to attend to diferent modalities.</p><p>We then set noise thresholds for selecting subsets of the data where the accuracy of each unimodal model decreased drastically in association with noise. We selected audio segments with a signalto-noise ratio ( ) of less than +5 dB, as the model's accuracy drastically decreased beyond this point. Similarly, we used the video segments that did not track either learner's face, as model accuracy was the lowest in this face-tracking condition. Then, we introduced extra modalities to this baseline unimodal model to examine if multimodal models outperform the unimodal baselines trained with noisy data from each single modality.</p><p>First, we tested the impact of integrating multimodal data on compensating for noisy language using a baseline model with a of 0.3. The baseline model's accuracy was 0.73, with F-1 scores of 0.40 for confusion and 0.37 for confict. We then constructed several multimodal models (as described above) with additive audio and video data. All except the late fusion model performed better than the baseline, with higher overall accuracy. Cross-attention fusion achieved the best performance by integrating language, audio, and video data together, with the highest accuracy of 0.80, an F-1 score of 0.46 for confusion, and an F-1 score of 0.48 for confict. Second, we tested the impact of integrating multimodal data toward compensating for noisy audio. A baseline model with a &lt; +5 dB achieved an accuracy of 0.61, with F-1 scores of 0.19 for confusion and 0.38 for confict. We then constructed several multimodal models with additive language and video data, where all multimodal models performed better than the baseline, with higher overall accuracy. Finally, we tested the impact of integrating multimodal data on compensating for incomplete video by experimenting with a baseline model using video segments that mistracked at least one learner. The baseline model's accuracy was 0.73, with F-1 scores of 0.46 for confusion and 0.32 for confict. We then constructed several multimodal models with additive language and video data, where all multimodal models performed better than the baseline, with higher overall accuracy. Table <ref type="table">5</ref> shows the performance of multimodal models when single modalities involve data noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">DISCUSSION</head><p>This study has investigated the impact of data noise on multimodal recognition of confusion and confict moments during dyads of  7.1 The Impact of Unimodal Data Noise 7.1.1 Noisy Language. The experimental results showed that unimodal models using language-derived features trained on noisy transcripts up to a of 0.2 could perform comparably well as models trained with clean transcripts manually generated by humans. Transcripts with a of 0.3 were too noisy, and the model performance sufered a drastic degradation. In another study, Southwell et al. <ref type="bibr">[46]</ref> conducted extensive experiments to compare three widely adopted ASR engines (Google, Rev.ai <ref type="bibr">[39]</ref>, and IBM Watson) on transcribing audio recordings of middle-school students engaged in small group work. The authors found that it was extremely difcult to obtain serviceable transcripts by current (2023) ASR engines, which generated overall high on this task (0.84 -0.95). These fndings highlight the main challenge of deploying intelligent systems to support collaboration in real-world classroom environments: obtaining serviceable transcriptions of student discourse. Indeed, on our corpus, cloud-based ASR performed similarly poorly, with Google Speech-to-text <ref type="bibr">[49]</ref> having an average of 0.78 (SD = 0.54), and those generated by IBM Watson [53] have an average of 0.89 (SD = 0.46). This fnding led to our decision to perturb manual transcripts for the purposes of experimentation.</p><p>7.1.2 Noisy Audio. Overall, the audio recordings in our dataset are noisy, with an average of +1.3 dB. The experimental results showed that the performance of unimodal models using audioderived features was very sensitive to noise and showed a steep degradation as soon as the audio decreased below +5 dB, where the overall accuracy decreased drastically from 0.79 to 0.61, the F-1 score of confusion from 0.40 to 0.19, and the F-1 score of confict from 0.48 to 0.39. These results suggest that audio data with of +5 dB could potentially be considered acceptable. It is challenging to collect audio of this quality in real classroom environments, where usually ranges from -7 dB to +5 dB <ref type="bibr">[17]</ref>.</p><p>Quality can be improved when learners use headsets and wear noise-canceling microphones close to their mouths, but a tradeof is that the headsets detract from the fuid interplay of individual, small group, and whole class discourse.</p><p>7.1.3 Incomplete Video. The experimental results showed that unimodal models using video-derived features trained with video segments tracking at least one learner's face in the pair could still perform comparably well to unimodal models trained with video segments tracking both learners' faces (an accuracy of 0.71 versus 0.79). It is not surprising that when a learner expresses confusion or confict, it can be detected through the speaker's face. However, our results found that reasonable classifcation accuracy can also be achieved even when only one learner's face is tracked. This fnding is consistent with a recent study by J&#228;rvenoja et al. <ref type="bibr">[18]</ref> where the authors investigated how socially shared emotions emerged during collaborative learning activities. Taken together, it appears that tracking at least a sub-group of learners' faces is sufcient for recognizing a speaker's confusion and or confict, but a highresolution wide-angle camera is still recommended.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">The Efect of Multimodal Modeling</head><p>This study trained several multimodal models and found that when there is data noise present in a single modality, fusing information from other modalities can efectively compensate for the negative impact of unimodal data noise, ultimately leading to better accuracy in recognizing a speaker's confusion and confict. Specifcally for noisy language, introducing additional audio and facial information could enhance model accuracy from 0.73 to 0.80. This improvement in accuracy indicates that audio and video data can also ofer valuable insights into detecting confusion and confict. The addition of audio and video data can reveal non-verbal cues, such as tone and facial expressions, which are often absent from text-based data. This information can provide additional context, helping the model to better understand the situation and make more accurate predictions. Similarly, for noisy audio, fusing additional language and facial information could enhance model accuracy from 0.61 to 0.88; for incomplete video, fusing additional language and audio information could improve model accuracy from 0.73 to 0.92. However, in a real data collection environment, a higher level of noise in the audio data will negatively impact both speech-to-text translation and prosody analysis. For speech-to-text translation, a higher level of audio noise can make it more difcult for ASR engines to accurately transcribe the speech. This can lead to a higher in the resulting transcript <ref type="bibr">[19]</ref>. Similarly, a higher level of audio noise can make it more difcult to accurately identify the audio features, such as patterns of pitch <ref type="bibr">[28]</ref>. Hence, the setup of intelligent systems in classrooms should aim to capture clean speech, if possible, either by using advanced microphones or separating learner groups to avoid ambient sound. To improve performance in transcribing noise speech, recent pre-trained ASR models can be fne-tuned.</p><p>In our comparison of multimodal models trained with various fusion techniques in the presence of noise, the results showed that neural-network-based fusion approaches generally achieved higher confusion and confict recognition accuracy than traditional early and late fusion approaches. The main advantage of neuralnetwork-based fusion approaches over model-agnostic-based fusion approaches lies in their ability to exploit the underlying relationships and mutual information among modalities <ref type="bibr">[2,</ref><ref type="bibr">22]</ref>. Hence, the experimental results suggest that in the face of the same data noise level, model-based fusion approaches could have more robust performance in modeling collaborative learning than modelagnostic fusion approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Limitations and Future Work</head><p>The current study has important limitations. First, our random perturbation approach to generate text errors may not fully simulate ASR errors, as audio transcribed with an ASR engine can have noisy text that is phonetically similar to the reference text. Second, the dataset was relatively small, consisting of recordings from just 25 learner dyads, so the acceptable noise level identifed here may not be generalizable to learners in other age groups or learning environments, such as online learning. In addition, the noise levels experimented with in this study were not fne-grained enough. We generated noisy language data with a granularity of 0.1; we split noisy audio data with a granularity of 5 dB. Using smaller granularities in future studies will provide more practical implications for deploying intelligent systems in actual classroom environments. Last, this study stopped short of attempting to improve the performance of the state-of-the-art multimodel fusion models tested. Toward this goal, we are currently developing an intelligent system that adopts an adaptive fusion strategy, in which information from diferent modalities is dynamically integrated based on the estimation of their noise level so that more informative modalities are prioritized to improve multimodal modeling performance over time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSION</head><p>Intelligent systems hold great promise to support collaborative learning, but the noise that accompanies the data poses great challenges to analyzing and understanding learners' dialogues. It is important to develop efective methods for intelligent systems to perform robustly in the face of substantial noise. This paper takes a frst step toward addressing this challenge by understanding the impact of noisy language, noisy audio, and incomplete video on modeling learners' interactions during collaborative activities.</p><p>The results of extensive experiments showed that in the language modality, the model's accuracy for detecting confusion and confict decreased drastically when the exceeded 20%. In the audio modality, the model's accuracy decreased sharply when the dropped below 5 dB. In the video modality, the model's accuracy remained relatively constant at a comparable level as long as at least one learner's face was successfully tracked. To further investigate the efect of integrating multimodal data, given the presence of unimodal data noise, we trained several multimodal models. The results showed that combining other modalities of data could efectively compensate for the negative efect of noise in unimodal data, ultimately leading to improved modeling accuracy.</p></div></body>
		</text>
</TEI>
