NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?

Guan, Y; Trinh, VA; Voleti, V; Whitehill, J (June 2025, IEEE International Conference on Multimedia&Expo 2025)

Free, publicly-accessible full text available June 30, 2026
Classroom Observation: Evaluating Instructional Support Automatically in Classroom for Young Children

Wang, J; Hankour, K; Zhang, Y; LoCasale-Crouch, J; Whitehill, J (March 2025, AAAI Workshop: iRAISE Innovation and Responsibility in AI-Supported Education)

Free, publicly-accessible full text available March 3, 2026
Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children

https://doi.org/10.5281/zenodo.14871875

Wang, Jiani; Dudy, Shiran; Hu, Xinlu; Wang, Zhiyong; Southwell, Rosy; Whitehill, Jacob (January 2025, Journal of educational data mining)

An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker diarization system we found was based on the ECAPA-TDNN speaker embedding model and used Whisper automatic speech recognition to identify speech segments. The diarization error rate (DER) in challenging noisy spontaneous classroom data was around 34%, and the correlations of estimated vs. human annotations of how much each student spoke reached 0.62. The accuracy of distinguishing teachers' speech from children's speech was 69.17%. We evaluated the system for potential accuracy bias across people of different skin tones and genders and found that the accuracy did not show statistically significantly differences across either dimension. Thus, the presented diarization system has potential to benefit educational research and to provide teachers and students with useful feedback to better understand their classroom dynamics.
more » « less
Free, publicly-accessible full text available January 1, 2026
Speaker Diarization in the Classroom: How Much Does Each Student Speak in Group Discussions?

Wang, J.; Dudy, S.; He, X.; Wang, Z.; Southwell, R.; Whitehill, J. (July 2024, Educational Data Mining)

Full Text Available
Tracking Classroom Movement Patterns with Person Re-ID

He, X.; Wang, J.; Trinh, V.A.; McReynolds, A.; Whitehill, J. (July 2024, Educational Data Mining)

Full Text Available
Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback

Whitehill, Jacob; LoCasale-Crouch, Jennifer (June 2024, Journal of Educational Data Mining)

With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how Large Language Models (LLMs) can be used to estimate "Instructional Support" domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of Instructional Support. Then, these utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson R up to 0.48) approaches human inter-rater reliability (up to R=0.55); (2) LLMs generally yield slightly greater accuracy than BoW for this task, though the best models often combined features extracted from both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions.
more » « less
Full Text Available
Automatic Speech Recognition Tuned for Child Speech in the Classroom

https://doi.org/10.1109/ICASSP48485.2024.10447428

Southwell, Rosy; Ward, Wayne; Trinh, Viet Anh; Clevenger, Charis; Clevenger, Clay; Watts, Emily; Reitman, Jason; D’Mello, Sidney; Whitehill, Jacob (April 2024, IEEE)

Full Text Available
Compositional clustering: Applications to multi-label object recognition and speaker identification

https://doi.org/10.1016/j.patcog.2023.109829

Li, Zeqian; He, Xinlu; Whitehill, Jacob (July 2023, Pattern Recognition)

Full Text Available
In Search of Negative Moments: Multi-Modal Analysis of Teacher Negativity in Classroom Observation Videos

Dai, Z.; McReynolds, A.; Whitehill, J. (January 2023, Educational Data Mining)

Full Text Available
Can the Mathematical Correctness of Object Configurations Affect the Accuracy of Their Perception?

Jiang, H.; Li, Z.; Whitehill, J. (June 2022, CVPR Workshop: 1st Workshop on Vision Datasets Understanding)

Full Text Available

Search for: All records