Speech and language development in children are crucial for ensuring
effective skills in their long-term learning ability. A child’s
vocabulary size at the time of entry into kindergarten is an early
indicator of their learning ability to read and potential long-term
success in school. The preschool classroom is thus a promising
venue for assessing growth in young children by measuring their
interactions with teachers as well as classmates. However, to date
limited studies have explored such naturalistic audio communications.
Automatic Speech Recognition (ASR) technologies provide an
opportunity for ’Early Childhood’ researchers to obtain knowledge
through automatic analysis of naturalistic classroom recordings in
measuring such interactions. For this purpose, 208 hours of audio
recordings across 48 daylong sessions are collected in a childcare
learning center in the United States using Language Environment
Analysis (LENA) devices worn by the preschool children. Approximately
29 hours of adult speech and 26 hours of child speech is
segmented using manual transcriptions provided by CRSS transcription
team. Traditional as well as End-to-End ASR models are trained
on adult/child speech data subset. Factorized Time Delay Neural
Network provides a best Word-Error-Rate (WER) of 35.05% on the
adult subset of the test set. End-to-End transformer models achieve
63.5% WER on the child subset of the test data. Next, bar plots
demonstrating the frequency of WH-question words in Science vs.
Reading activity areas of the preschool are presented for sessions in
the test set. It is suggested that learning spaces could be configured
to encourage greater adult-child conversational engagement given
such speech/audio assessment strategies.
more »
« less
Can Smartphones be a cost-effective alternative to LENA for Early Childhood Language Intervention?
Although non-profit commercial products such as LENA can
provide valuable feedback to parents and early childhood educators
about their children’s or student’s daily communication
interactions, their cost and technology requirements put them
out of reach of many families who could benefit. Over the last
two decades, smartphones have become commonly used in most
households irrespective of their socio-economic background. In
this study, conducted during the COVID-19 pandemic, we aim
to compare audio collected on LENA recorders versus smartphones
available to families in an unsupervised data collection
protocol. Approximately 10 hours of audio evaluated in this
study was collected by three families in their homes during
parent-child science book reading activities with their children.
We report comparisons and found similar performance between
the two audio capture devices based on their speech signal-tonoise
ratio (NIST STNR) and word-error-rates calculated using
automatic speech recognition (ASR) engines. Finally, we discuss
implications of this study for expanding this technology to
more diverse populations, limitations and future directions.
more »
« less
- Award ID(s):
- 1918032
- PAR ID:
- 10478767
- Publisher / Repository:
- ISCA
- Date Published:
- Journal Name:
- Workshop on Speech for Social Good (S4SG)
- Page Range / eLocation ID:
- 10 to 14
- Subject(s) / Keyword(s):
- parent-child book reading smartphone speech recognition early childhood
- Format(s):
- Medium: X
- Location:
- Incheon, Korea
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Young children’s friendships fuel essential developmental outcomes (e.g., social-emotional competence) and are thought to provide even greater benefits to children with or at-risk for disabilities. Teacher and parent report and sociometric measures are commonly used to measure friendships, and ecobehavioral assessment has been used to capture its features on a momentary basis. In this proof-of-concept study, we use Ubisense, the Language ENvironmental Analysis (LENA) recorder, and advanced speech processing algorithms to capture features of friendship –child-peer speech and proximity within activity areas . We collected 12,332 1-second speech and location data points. Our preliminary results indicate the focal child at-risk for a disability and each playmate spent time vocalizing near one another across 4 activity areas. Additionally, compared to the Blocks activity area, the children had significantly lower odds of talking while in proximity during Manipulatives and Science. This suggests that the activity areas children occupy may affect their engagement with peers and, in turn, the friendships they development. The proposed approach is a groundbreaking advance to understanding and supporting children’s friendships.more » « less
-
null (Ed.)Speech and language development in children is crucial for ensuring optimal outcomes in their long term development and life-long educational journey. A child’s vocabulary size at the time of kindergarten entry is an early indicator of learning to read and potential long-term success in school. The preschool classroom is thus a promising venue for monitoring growth in young children by measuring their interactions with teachers and classmates. Automatic Speech Recognition (ASR) technologies provide the ability for ‘Early Childhood’ researchers for automatically analyzing naturalistic recordings in these settings. For this purpose, data are collected in a high-quality childcare center in the United States using Language Environment Analysis (LENA) devices worn by the preschool children. A preliminary task for ASR of daylong audio recordings would involve diarization, i.e., segmenting speech into smaller parts for identifying ‘who spoke when.’ This study investigates a Deep Learning-based diarization system for classroom interactions of 3-5-year-old children. However, the focus is on ’speaker group’ diarization, which includes classifying speech segments as being from adults or children from across multiple classrooms. SincNet based diarization systems achieve utterance level Diarization Error Rate of 19.1%. Utterance level speaker group confusion matrices also show promising, balanced results. These diarization systems have potential applications in developing metrics for adult-to-child or child-to-child rapid conversational turns in a naturalistic noisy early childhood setting. Such technical advancements will also help teachers better and more efficiently quantify and understand their interactions with children, make changes as needed, and monitor the impact of those changes.more » « less
-
Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams.more » « less
-
Speech and language development are early indicators of overall analytical and learning ability in children. The preschool classroom is a rich language environment for monitoring and ensuring growth in young children by measuring their vocal interactions with teachers and classmates. Early childhood researchers are naturally interested in analyzing naturalistic vs. controlled lab recordings to measure both quality and quantity of such interactions. Unfortunately, present-day speech technologies are not capable of addressing the wide dynamic scenario of early childhood classroom settings. Due to the diversity of acoustic events/conditions in such daylong audio streams, automated speaker diarization technology would need to be advanced to address this challenging domain for segmenting audio as well as information extraction. This study investigates an alternate Deep Learning-based diarization solution for segmenting classroom interactions of 3-5 year old children with teachers. In this context, the focus on speech-type diarization which classifies speech segments as being either from adults or children partitioned across multiple classrooms. Our proposed ResNet model achieves a best F1-score of ∼78.0% on data from two classrooms, based on dev and test sets of each classroom. It is utilized with Automatic Speech Recognition-based resegmentation modules to perform child-adult diarization. Additionally, F1-scores are obtained for individual segments with corresponding speaker tags (e.g., adult vs. child), which provide knowledge for educators on child engagement through naturalistic communications. The study demonstrates the prospects of addressing educational assessment needs through communication audio stream analysis, while maintaining both security and privacy of all children and adults. The resulting child communication metrics have been used for broad-based feedback for teachers with the help of visualizations.more » « less