NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DNN-based monaural speech enhancement using alternate analysis windows for phase and magnitude modification

https://doi.org/10.21437/Interspeech.2024-2244

Liu, Xi; Hansen, John HL (September 2024, ISCA)

In recent decades, considerable research has been devoted to speech enhancement leveraging the short-term Fourier transform (STFT) analysis. As speech processing technology evolves, the significance of phase information in enhancing speech intelligibility becomes more noticeable. Typically, the Hanning window has been widely employed as analysis window in STFT. In this study, we propose the Chebyshev window for phase analysis, and the Hanning window for magnitude analysis. Next, we introduce a novel cepstral domain enhancement approach designed to robustly reinforce the harmonic structure of speech. The performance of our model is evaluated using the DNS challenge test set as well as the naturalistic APOLLO Fearless Steps evaluation set. Experimental results demonstrate that the Chebyshev-based phase solution outperforms the Hanning option for in phase-aware speech enhancement. Furthermore, the incorporation of quefrency emphasis proves effective in enhancing overall speech quality.
more » « less
Full Text Available
Fearless Steps Apollo: Towards Community Resource Development for Science, Technology, Education, and Historical Preservation

Hansen, J.H.L.; Joglekar, A.; Shekar, M.M.C.; Chen, S.-J.; Liu X. (April 2024, IEEE ICASSP-24: Inter. Conf. Acoustics, Speech, and Signal Processing)
submitted - in Review for IEEE ICASSP-2024) (Ed.)
The Fearless Steps Apollo (FS-APOLLO) resource is a collection of over 150,000 hours of audio, associated meta-data, and supplemental technological toolkit intended to benefit the (i) speech processing technology, (ii) communication science, team-based psychology, and history, and (iii) education/STEM, preservation/archival communities. The FSAPOLLO initiative which started in 2014 has since resulted in the preservation of over 75,000 hours of NASA Apollo Missions audio. Systems created for this audio collection have led to the emergence of several new Speech and Language Technologies (SLT). This paper seeks to provide an overview of the latest advancements in the FS-Apollo effort and explore upcoming strategies in big-data deployment, outreach, and novel avenues of K-12 and STEM education facilitated through this resource.
more » « less
Full Text Available
Fearless Steps Apollo: Team Communications Based Community Resource Development for Science, Technology, Education, and Historical Preservation

https://doi.org/10.1109/ICASSP48485.2024.10446811

Hansen, John HL; Joglekar, Aditya; Shekar, Meena_M C; Chen, Szu-Jui; Liu, Xi (April 2024, IEEE)

The Fearless Steps Apollo (FS-APOLLO) resource is a collection of 150,000 hours of audio, associated meta-data, and supplemental speech technology infrastructure intended to benefit the (i) speech processing technology, (ii) communication science, team-based psychology, and (iii) education/STEM, history/preservation/archival communities. The FS-APOLLO initiative which started in 2014 has since resulted in the preservation of over 75,000 hours of NASA Apollo Missions audio. Systems created for this audio collection have led to the emergence of several new Speech and Language Technologies (SLT). This paper seeks to provide an overview of the latest advancements in the FS-Apollo effort and explore upcoming strategies in big-data deployment, outreach, and novel avenues of K-12 and STEM education facilitated through this resource.
more » « less
Full Text Available
Dual-Path Minimum-Phase and All-Pass Decomposition Network for Single Channel Speech Dereverberation

https://doi.org/10.1109/ICASSP48485.2024.10446719

Liu, Xi; Chen, Szu-Jui; Hansen, John_H L (April 2024, IEEE)

With the development of deep neural networks (DNN), many DNN-based speech dereverberation approaches have been proposed to achieve significant improvement over the traditional methods. However, most deep learning-based dereverberation methods solely focus on suppressing time-frequency domain reverberations without utilizing cepstral domain features which are potentially useful for dereverberation. In this paper, we propose a dual-path neural network structure to separately process minimum-phase and all-pass components of single channel speech. First, we decompose speech signal into minimum-phase and all-pass components in cepstral domain, then Conformer embedded U-Net is used to remove reverberations of both components. Finally, we combine these two processed components together to synthesize the enhanced output. The performance of proposed method is tested on REVERB-Challenge evaluation dataset in terms of commonly used objective metrics. Experimental results demonstrate that our method outperforms other compared methods.
more » « less
Full Text Available
Apollo’s Unheard Voices: Graph Attention Networks for Speaker Diarization and Clustering for Fearless Steps Apollo Collection

https://doi.org/10.1109/ICASSP48485.2024.10446231

Shekar, Meena_M C; Hansen, John_H L (April 2024, IEEE)

Speaker diarization has traditionally been explored using datasets that are either clean, feature a limited number of speakers, or have a large volume of data but lack the complexities of real-world scenarios. This study takes a unique approach by focusing on the Fearless Steps APOLLO audio resource, a challenging data that contains over 70,000 hours of audio data (A-11: 10k hrs), the majority of which remains unlabeled. This corpus presents considerable challenges such as diverse acoustic conditions, high levels of background noise, overlapping speech, data imbalance, and a variable number of speakers with varying utterance duration. To address these challenges, we propose a robust speaker diarization framework built on dynamic Graph Attention Network optimized using data augmentation. Our proposed framework attains a Diarization Error Rate (DER) of 19.6% when evaluated using ground truth speech segments. Notably, our work is the first to recognize, track, and perform conversational analysis on the entire Apollo-11 mission for speakers who were unidentified until now. This work stands as a significant contribution to both historical archiving and the development of robust diarization systems, particularly relevant for challenging real-world scenarios.
more » « less
Full Text Available
Speaker Tracking using Graph Attention Networks with Varying Duration Utterances across Multi-Channel Naturalistic Data: Fearless Steps Apollo-11 Audio Corpus

https://doi.org/10.21437/Interspeech.2023-1258

Shekar, Meena M.; Hansen, John H. (August 2023, ISCA INTERSPEECH-2023)

Speaker tracking in spontaneous naturalistic data continues to be a major research challenge, especially for short turn-taking communications. The NASA Apollo-11 space mission brought astronauts to the moon and back, where team based voice communications were captured. Building robust speaker classification models for this corpus has significant challenges due to variability of speaker turns, imbalanced speaker classes, and time-varying background noise/distortions. This study proposes a novel approach for speaker classification and tracking, utilizing a graph attention network framework that builds upon pretrained speaker embeddings. The model’s robustness is evaluated on a number of speakers (10-140), achieving classification accuracy of 90.78% for 10 speakers, and 79.86% for 140 speakers. Furthermore, a secondary investigation focused on tracking speakers-of-interest(SoI) during mission critical phases, essentially serves as a lasting tribute to the 'Heroes Behind the Heroes'.
more » « less
Full Text Available
Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting

https://doi.org/10.1109/ICASSP49357.2023.10095436

López-Espejo, Iván; Shekar, Ram C.; Tan, Zheng-Hua; Jensen, Jesper; Hansen, John H. (June 2023, IEEE ICASSP-2023: Inter. Conf. Audio, Speech, and Signal Processing)

In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3× energy consumption reduction.
more » « less
Full Text Available
Single-channel speech separation using soft-minimum permutation invariant training

https://doi.org/10.1016/j.specom.2023.05.005

Yousefi, Midia; Hansen, John H.L. (June 2023, Speech Communication)

The goal of speech separation is to extract multiple speech sources from a single microphone recording. Recently, with the advancement of deep learning and availability of large datasets, speech separation has been formulated as a supervised learning problem. These approaches aim to learn discriminative patterns of speech, speakers, and background noise using a supervised learning algorithm, typically a deep neural network. A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal, referred to as label permutation ambiguity. Permutation ambiguity refers to the problem of determining the output-label assignment between the separated sources and the available single-speaker speech labels. Finding the best output-label assignment is required for calculation of separation error, which is later used for updating parameters of the model. Recently, Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem. However, the overconfident choice of the output-label assignment by PIT results in a sub-optimal trained model. In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment. Our proposed method entitled trainable Softminimum PIT is then employed on the same Long-Short Term Memory (LSTM) architecture used in Permutation Invariant Training (PIT) speech separation method. The results of our experiments show that the proposed method outperforms conventional PIT speech separation significantly (p-value < 0.01) by +1dB in Signal to Distortion Ratio (SDR) and +1.5dB in Signal to Interference Ratio (SIR).
more » « less
Full Text Available
Fearless Steps APOLLO: Challenges in keyword spotting and topic detection for naturalistic audio streams

Joglekar, A.; Lopez-Espejo, I.; Hansen, J.H.L. (May 2023, Program of the meeting Acoustical Society of America)

Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams.
more » « less
Full Text Available
Historical Audio Search and Preservation: Finding Waldo Within the Fearless Steps Apollo 11 Naturalistic Audio Corpus

https://doi.org/10.1109/MSP.2023.3237001

Chandra Shekar, Meena M.; Hansen, John H.L. (May 2023, IEEE Signal Processing Magazine)

Apollo 11 was the first manned space mission to successfully bring astronauts to the Moon and return them safely. As part of NASA’s goal in assessing team and mission success, all voice communications within mission control, astronauts, and support staff were captured using a multichannel analog system, which until recently had never been made available. More than 400 personnel served as mission specialists/support who communicated across 30 audio loops, resulting in 9,000+ h of data. It is essential to identify each speaker’s role during Apollo and analyze group communication to achieve a common goal. Manual annotation is costly, so this makes it necessary to determine robust speaker identification and tracking methods. In this study, a subset of 100hr derived from the collective 9,000hr of the Fearless Steps (FSteps) Apollo 11 audio data were investigated, corresponding to three critical mission phases: liftoff, lunar landing, and lunar walk. A speaker recognition assessment is performed on 140 speakers from a collective set of 183 NASA mission specialists who participated, based on sufficient training data obtained from 5 (out of 30) mission channels. We observe that SincNet performs the best in terms of accuracy and F score achieving 78.6% accuracy. Speaker models trained on specific phases are also compared with each other to determine if stress, g-force/atmospheric pressure, acoustic environments, etc., impact the robustness of the models. Higher performance was obtained using i-vector and x-vector systems for phases with limited data, such as liftoff and lunar walk. When provided with a sufficient amount of data (lunar landing phase), SincNet was shown to perform the best. This represents one of the first investigations on speaker recognition for massively large team-based communications involving naturalistic communication data. In addition, we use the concept of “Where’s Waldo?” to identify key speakers of interest (SOIs) and track them over the complete FSteps audio corpus. This additional task provides an opportunity for the research community to transition the FSteps collection as an educational resource while also serving as a tribute to the “heroes behind the heroes of Apollo.”
more » « less
Full Text Available

« Prev Next »

Search for: All records