skip to main content


Title: Speech and language processing for assessing child–adult interaction based on diarization and location
Understanding and assessing child verbal communication patterns is critical in facilitating effective language development. Typically speaker diarization is performed to explore children’s verbal engagement. Understanding which activity areas stimulate verbal communication can help promote more efficient language development. In this study, we present a two stage children vocal engagement prediction system that consists of (1) a near to real-time, noise robust system that measures the duration of child-to-adult and child-to-child conversations, and tracks the number of conversational turn-takings, (2) a novel child location tracking strategy, that determines in which activity areas a child spends most/least of their time. A proposed child–adult turn-taking solution relies exclusively on vocal cues observed during the interaction between a child and other children, and/or classroom teachers. By employing a threshold optimized speech activity detection using a linear combination of voicing measures, it is possible to achieve effective speech/non-speech segment detection prior to conversion assessment. This TO-COMBO-SAD reduces classification error rates for adult-child audio by 21.34% and 27.3% compared to a baseline i-Vector and standard Bayesian Information Criterion diarization systems, respectively. In addition, this study presents a unique location tracking system adult-child that helps determine the quantity of child–adult communication in specific activity areas, and which activities stimulate voice communication engagement in a child–adult education space. We observe that our proposed location tracking solution offers unique opportunities to assess speech and language interaction for children, and quantify the location context which would contribute to improve verbal communication.  more » « less
Award ID(s):
1918012
NSF-PAR ID:
10179276
Author(s) / Creator(s):
Date Published:
Journal Name:
International journal of speech technology
ISSN:
1572-8110
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs. controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates an alternate Deep Learning-based diarization solution for segmenting classroom interactions of 3-5 year old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our proposed ResNet model achieves a best F1-score of ∼78.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with Automatic Speech Recognition-based resegmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs. child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations. 
    more » « less
  2. Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs. controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates an alternate Deep Learning-based diarization solution for segmenting classroom interactions of 3-5 year old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our proposed ResNet model achieves a best F1-score of ∼71.0% on data from two classrooms, based on dev and test sets of each classroom. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs. child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations. 
    more » « less
  3. Young children’s friendships fuel essential developmental outcomes (e.g., social-emotional competence) and are thought to provide even greater benefits to children with or at-risk for disabilities. Teacher and parent report and sociometric measures are commonly used to measure friendships, and ecobehavioral assessment has been used to capture its features on a momentary basis. In this proof-of-concept study, we use Ubisense, the Language ENvironmental Analysis (LENA) recorder, and advanced speech processing algorithms to capture features of friendship –child-peer speech and proximity within activity areas . We collected 12,332 1-second speech and location data points. Our preliminary results indicate the focal child at-risk for a disability and each playmate spent time vocalizing near one another across 4 activity areas. Additionally, compared to the Blocks activity area, the children had significantly lower odds of talking while in proximity during Manipulatives and Science. This suggests that the activity areas children occupy may affect their engagement with peers and, in turn, the friendships they development. The proposed approach is a groundbreaking advance to understanding and supporting children’s friendships. 
    more » « less
  4. Assessing child growth in terms of speech and language is a crucial indicator of long term learning ability and life-long progress. Since the preschool classroom provides a potent opportunity for monitoring growth in young children’s interactions, analyzing such data has come into prominence for early childhood researchers. The foremost task of any analysis of such naturalistic recordings would involve parsing and tagging the interactions between adults and young children. An automated tagging system will provide child interaction metrics and would be important for any further processing. This study investigates the language environment of 3-5 year old children using a CRSS based diarization strategy employing an i-vector-based baseline that captures adult-to-child or childto- child rapid conversational turns in a naturalistic noisy early childhood setting. We provide analysis of various loss functions and learning algorithms using Deep Neural Networks to separate child speech from adult speech. Performance is measured in terms of diarization error rate, Jaccard error rate and shows good results for tagging adult vs children’s speech. Distinction between primary and secondary child would be useful for monitoring a given child and analysis is provided for the same. Our diarization system provides insights into the direction for preprocessing and analyzing challenging naturalistic daylong child speech recordings. 
    more » « less
  5. Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with both teachers and classmates. Early childhood researchers recognize the importance in analyzing naturalistic vs. controlled lab recordings to measure both quality and quantity of child interactions. Recently, large language model-based speech technologies have performed well on conversational speech recognition. In this regard, we assess performance of such models on the wide dynamic scenario of early childhood classroom settings. This study investigates an alternate Deep Learning-based Teacher-Student learning solution for recognizing adult speech within preschool interactions. Our proposed adapted model achieves the best F1-score for recognizing most frequent 400 words on test sets for both classrooms. Additionally, F1-scores for alternate word groups provides a breakdown of performance across relevant language-based word-categories. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics from this study can also be used for broad-based feedback for teachers. 
    more » « less