Search for: All records

Creators/Authors contains: "Hansen, John H.L."

« Prev Next »

Total Resources

19

Resource Type
Conference Paper

13

Conference Proceeding

0

Dataset

0

Journal Article

6

Workshop Report

0

Availability
Full Text / Resource Available

18

Citation Only

1

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Historical Audio Search and Preservation: Finding Waldo Within the Fearless Steps Apollo 11 Naturalistic Audio Corpus

https://doi.org/10.1109/MSP.2023.3237001

Chandra Shekar, Meena M. ; Hansen, John H.L. ( May 2023 , IEEE Signal Processing Magazine)

Apollo 11 was the first manned space mission to successfully bring astronauts to the Moon and return them safely. As part of NASA’s goal in assessing team and mission success, all voice communications within mission control, astronauts, and support staff were captured using a multichannel analog system, which until recently had never been made available. More than 400 personnel served as mission specialists/support who communicated across 30 audio loops, resulting in 9,000+ h of data. It is essential to identify each speaker’s role during Apollo and analyze group communication to achieve a common goal. Manual annotation is costly, so this makes it necessary to determine robust speaker identification and tracking methods. In this study, a subset of 100hr derived from the collective 9,000hr of the Fearless Steps (FSteps) Apollo 11 audio data were investigated, corresponding to three critical mission phases: liftoff, lunar landing, and lunar walk. A speaker recognition assessment is performed on 140 speakers from a collective set of 183 NASA mission specialists who participated, based on sufficient training data obtained from 5 (out of 30) mission channels. We observe that SincNet performs the best in terms of accuracy and F score achieving 78.6% accuracy. Speaker models trained on specific phases are also compared with each other to determine if stress, g-force/atmospheric pressure, acoustic environments, etc., impact the robustness of the models. Higher performance was obtained using i-vector and x-vector systems for phases with limited data, such as liftoff and lunar walk. When provided with a sufficient amount of data (lunar landing phase), SincNet was shown to perform the best. This represents one of the first investigations on speaker recognition for massively large team-based communications involving naturalistic communication data. In addition, we use the concept of “Where’s Waldo?” to identify key speakers of interest (SOIs) and track them over the complete FSteps audio corpus. This additional task provides an opportunity for the research community to transition the FSteps collection as an educational resource while also serving as a tribute to the “heroes behind the heroes of Apollo.”
more » « less
Free, publicly-accessible full text available May 1, 2024
Domain Expansion for End-to-End Speech Recognition: Applications for Accent/Dialect Speech

https://doi.org/10.1109/TASLP.2022.3233238

Ghorbani, Shahram ; Hansen, John H.L. ( January 2023 , IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Training Automatic Speech Recognition (ASR) systems with sequentially incoming data from alternate domains is an essential milestone in order to reach human intelligibility level in speech recognition. The main challenge of sequential learning is that current adaptation techniques result in significant performance degradation for previously-seen domains.To mitigate the catastrophic forgetting problem, this study proposes effective domain expansion techniques for two scenarios: 1)where only new domain data is available, and 2) where both prior and new domain data are available. We examine the efficacy of the approaches through experiments on adapting a model trained with native English to different English accents. For the first scenario, we study several existing and proposed regularization-based approaches to mitigate performance loss of initial data.The experiments demonstrate the superior performanceo four proposed Soft KL-Divergence(SKLD)-Model Averaging (MA) approach. In this approach, SKLD first alleviates the forgetting problem during adaptation; next, MA makes the final efficient compromise between the two domains by averaging parameters of the initial and adapted models. For the second scenario, we explore several rehearsal-based approaches, which leverage initial data to maintain the original model performance.We propose Gradient Averaging (GA) as well as an approach which operates by averaging gradients computed for both initial and new domains. Experiments demonstrate that GA outperforms retraining and specifically designed continual learning approaches, such as Averaged Gradient Episodic Memory (AGEM). Moreover, GA significantly improves computational costs over the complete retraining approach.
more » « less
Full Text Available
Challenges in Metadata Creation for Massive Naturalistic Team-Based Audio Data

https://doi.org/10.21437/Interspeech.2022-11243

Belitz, Chelzy ; Hansen, John H.L. ( September 2022 , ISCA INTERSPEECH-2022)

A broad range of research fields benefit from the information extracted from naturalistic audio data. Speech research typically relies on the availability of human-generated metadata tags to comprise a set of “ground truth” labels for the development of speech processing algorithms. While the manual generation of metadata tags may be feasible on a small scale, unique problems arise when creating speech resources for massive, naturalistic audio data. This paper presents a general discussion on these challenges and highlights suggestions when creating metadata for speech resources that are intended to be useful both in speech research and in other fields. Further, it provides an overview of how the task of creating a speech resource for various communities has been and is continuing to be approached for the massive corpus of audio from the historic NASA Apollo missions, which includes tens of thousands of hours of naturalistic, team-based audio data featuring numerous speakers across multiple points in history.
more » « less
Full Text Available
FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition

https://doi.org/10.21437/Interspeech.2022-10917

Chen, Szu-Jui ; Xie, Jiamin ; Hansen, John H.L. ( September 2022 , ISCA INTERSPEECH-2022)

Self-supervised learning representations (SSLR) have resulted in robust features for downstream tasks in many fields. Recently, several SSLRs have shown promising results on automatic speech recognition (ASR) benchmark corpora. However, previous studies have only shown performance for solitary SSLRs as an input feature for ASR models. In this study, we propose to investigate the effectiveness of diverse SSLR combinations using various fusion methods within end-to-end (E2E) ASR models. In addition, we will show there are correlations between these extracted SSLRs. As such, we further propose a feature refinement loss for decorrelation to efficiently combine the set of input features. For evaluation, we show that the proposed “FeaRLESS learning features” perform better than systems without the proposed feature refinement loss for both the WSJ and Fearless Steps Challenge (FSC) corpora.
more » « less
Full Text Available
Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

https://doi.org/10.21437/Interspeech.2022-11039

Yang, Mu ; Hirschi, Kevin ; Looney, Stephen Daniel ; Kang, Okim ; Hansen, John H.L. ( September 2022 , Interspeech)

Full Text Available
Can Smartphones be a cost-effective alternative to LENA for Early Childhood Language Intervention?

https://doi.org/10.21437/S4SG.2022-3

Dutta, Satwik ; Reyna, Jacob C. ; Buzhardt, Jay F. ; Irvin, Dwight ; Hansen, John H.L. ( September 2022 , Workshop on Speech for Social Good (S4SG))

Although non-profit commercial products such as LENA can provide valuable feedback to parents and early childhood educators about their children’s or student’s daily communication interactions, their cost and technology requirements put them out of reach of many families who could benefit. Over the last two decades, smartphones have become commonly used in most households irrespective of their socio-economic background. In this study, conducted during the COVID-19 pandemic, we aim to compare audio collected on LENA recorders versus smartphones available to families in an unsupervised data collection protocol. Approximately 10 hours of audio evaluated in this study was collected by three families in their homes during parent-child science book reading activities with their children. We report comparisons and found similar performance between the two audio capture devices based on their speech signal-tonoise ratio (NIST STNR) and word-error-rates calculated using automatic speech recognition (ASR) engines. Finally, we discuss implications of this study for expanding this technology to more diverse populations, limitations and future directions.
more » « less
Full Text Available
Assessing child communication engagement and statistical speech patterns for American English via speech recognition in naturalistic active learning spaces

https://doi.org/10.1016/j.specom.2022.01.006

Lileikyte, Rasa ; Irvin, Dwight ; Hansen, John H.L. ( May 2022 , Speech Communication)

Full Text Available
Challenges remain in Building ASR for Spontaneous Preschool Children Speech in Naturalistic Educational Environments

https://doi.org/10.21437/Interspeech.2022-555

Dutta, Satwik ; Tao, Sarah Anne ; Reyna, Jacob C. ; Hacker, Rebecca Elizabeth ; Irvin, Dwight W. ; Buzhardt, Jay F. ; Hansen, John H.L. ( September 2022 , ISCA INTERSPEECH-2022)

Monitoring child development in terms of speech/language skills has a long-term impact on their overall growth. As student diversity continues to expand in US classrooms, there is a growing need to benchmark social-communication engagement, both from a teacher-student perspective, as well as student-student content. Given various challenges with direct observation, deploying speech technology will assist in extracting meaningful information for teachers. These will help teachers to identify and respond to students in need, immediately impacting their early learning and interest. This study takes a deep dive into exploring various hybrid ASR solutions for low-resource spontaneous preschool (3-5yrs) children (with & without developmental delays) speech, being involved in various activities, and interacting with teachers and peers in naturalistic classrooms. Various out-of-domain corpora over a wide and limited age range, both scripted and spontaneous were considered. Acoustic models based on factorized TDNNs infused with Attention, and both N-gram and RNN language models were considered. Results indicate that young children have significantly different/ developing articulation skills as compared to older children. Out-of-domain transcripts of interactions between young children and adults however enhance language model performance. Overall transcription of such data, including various non-linguistic markers, poses additional challenges.
more » « less
Full Text Available
Speaker Conditioning of Acoustic Models Using Affine Transformation for Multi-Speaker Speech Recognition

https://doi.org/10.1109/ASRU51503.2021.9688231

Yousefi, Midia ; Hansen, John H.L. ( December 2021 , IEEE ASRU-2021: Automatic Speech Recognition & Understanding Workshop)

This study addresses the problem of single-channel Automatic Speech Recognition of a target speaker within an overlap speech scenario. In the proposed method, the hidden representations in the acoustic model are modulated by speaker auxiliary information to recognize only the desired speaker. Affine transformation layers are inserted into the acoustic model network to integrate speaker information with the acoustic features. The speaker conditioning process allows the acoustic model to perform computation in the context of target-speaker auxiliary information. The proposed speaker conditioning method is a general approach and can be applied to any acoustic model architecture. Here, we employ speaker conditioning on a ResNet acoustic model. Experiments on the WSJ corpus show that the proposed speaker conditioning method is an effective solution to fuse speaker auxiliary information with acoustic features for multi-speaker speech recognition, achieving +9% and +20% relative WER reduction for clean and overlap speech scenarios, respectively, compared to the original ResNet acoustic model baseline.
more » « less
Full Text Available
Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora

https://doi.org/10.1109/ASRU51503.2021.9688225

Chen, Szu-Jui ; Xia, Wei ; Hansen, John H.L. ( December 2021 , IEEE ASRU-2021: Automatic Speech Recognition & Understanding Workshop)

In this study, we propose to investigate triplet loss for the purpose of an alternative feature representation for ASR. We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL, for acoustic modeling to represent the acoustic characteristics of each audio. This strategy is then applied to the CHiME-4 corpus and CRSS-UTDallas Fearless Steps Corpus, with emphasis on the 100-hour challenge corpus which consists of 5 selected NASA Apollo-11 channels. An analysis of the extracted embeddings provides the foundation needed to characterize training utterances into distinct groups based on acoustic distinguishing properties. Moreover, we also demonstrate that triplet-loss based embedding performs better than i-Vector in acoustic modeling, confirming that the triplet loss is more effective than a speaker feature. With additional techniques such as pronunciation and silence probability modeling, plus multi-style training, we achieve a +5.42% and +3.18% relative WER improvement for the development and evaluation sets of the Fearless Steps Corpus. To explore generalization, we further test the same technique on the 1 channel track of CHiME-4 and observe a +11.90% relative WER improvement for real test data.
more » « less
Full Text Available

« Prev Next »