Understanding and assessing child verbal communication patterns is critical in facilitating effective language development. Typically speaker diarization is performed to explore children’s verbal engagement. Understanding which activity areas stimulate verbal communication can help promote more efficient language development. In this study, we present a two stage children vocal engagement prediction system that consists of (1) a near to real-time, noise robust system that measures the duration of child-to-adult and child-to-child conversations, and tracks the number of conversational turn-takings, (2) a novel child location tracking strategy, that determines in which activity areas a child spends most/least of their time. A proposed child–adult turn-taking solution relies exclusively on vocal cues observed during the interaction between a child and other children, and/or classroom teachers. By employing a threshold optimized speech activity detection using a linear combination of voicing measures, it is possible to achieve effective speech/non-speech segment detection prior to conversion assessment. This TO-COMBO-SAD reduces classification error rates for adult-child audio by 21.34% and 27.3% compared to a baseline i-Vector and standard Bayesian Information Criterion diarization systems, respectively. In addition, this study presents a unique location tracking system adult-child that helps determine the quantity of child–adult communication in specific activity areas, and whichmore »
Assessing Child Communication Engagement via Speech Recognition in Naturalistic Active Learning Spaces
The ability to assess children’s conversational interaction is critical
in determining language and cognitive proficiency for typically
developing and at-risk children. The earlier at-risk child
is identified, the earlier support can be provided to reduce the
social impact of the speech disorder. To date, limited research
has been performed for young child speech recognition in classroom
settings. This study addresses speech recognition research
with naturalistic children’s speech, where age varies from 2.5
to 5 years. Data augmentation is relatively under explored for
child speech. Therefore, we investigate the effectiveness of data
augmentation techniques to improve both language and acoustic
models. We explore alternate text augmentation approaches
using adult data, Web data, and via text generated by recurrent
neural networks. We also compare several acoustic augmentation
techniques: speed perturbation, tempo perturbation, and
adult data. Finally, we comment on child word count rates to
assess child speech development.
- Award ID(s):
- 2016725
- Publication Date:
- NSF-PAR ID:
- 10180044
- Journal Name:
- ISCA ODYSSEY-2020
- Page Range or eLocation-ID:
- 396 to 401
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Children’s automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. How-ever, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children’s ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children’s speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization.
-
Assessing child growth in terms of speech and language is a crucial indicator of long term learning ability and life-long progress. Since the preschool classroom provides a potent opportunity for monitoring growth in young children’s interactions, analyzing such data has come into prominence for early childhood researchers. The foremost task of any analysis of such naturalistic recordings would involve parsing and tagging the interactions between adults and young children. An automated tagging system will provide child interaction metrics and would be important for any further processing. This study investigates the language environment of 3-5 year old children using a CRSS based diarization strategy employing an i-vector-based baseline that captures adult-to-child or childto- child rapid conversational turns in a naturalistic noisy early childhood setting. We provide analysis of various loss functions and learning algorithms using Deep Neural Networks to separate child speech from adult speech. Performance is measured in terms of diarization error rate, Jaccard error rate and shows good results for tagging adult vs children’s speech. Distinction between primary and secondary child would be useful for monitoring a given child and analysis is provided for the same. Our diarization system provides insights into the direction for preprocessing and analyzing challengingmore »
-
Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task.
-
Young children’s friendships fuel essential developmental outcomes (e.g., social-emotional competence) and are thought to provide even greater benefits to children with or at-risk for disabilities. Teacher and parent report and sociometric measures are commonly used to measure friendships, and ecobehavioral assessment has been used to capture its features on a momentary basis. In this proof-of-concept study, we use Ubisense, the Language ENvironmental Analysis (LENA) recorder, and advanced speech processing algorithms to capture features of friendship –child-peer speech and proximity within activity areas . We collected 12,332 1-second speech and location data points. Our preliminary results indicate the focal child at-risk for a disability and each playmate spent time vocalizing near one another across 4 activity areas. Additionally, compared to the Blocks activity area, the children had significantly lower odds of talking while in proximity during Manipulatives and Science. This suggests that the activity areas children occupy may affect their engagement with peers and, in turn, the friendships they development. The proposed approach is a groundbreaking advance to understanding and supporting children’s friendships.