skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting
In the context of keyword spotting (KWS), the replacement of handcrafted speech features by learnable features has not yielded superior KWS performance. In this study, we demonstrate that filterbank learning outperforms handcrafted speech features for KWS whenever the number of filterbank channels is severely decreased. Reducing the number of channels might yield certain KWS performance drop, but also a substantial energy consumption reduction, which is key when deploying common always-on KWS on low-resource devices. Experimental results on a noisy version of the Google Speech Commands Dataset show that filterbank learning adapts to noise characteristics to provide a higher degree of robustness to noise, especially when dropout is integrated. Thus, switching from typically used 40-channel log-Mel features to 8-channel learned features leads to a relative KWS accuracy loss of only 3.5% while simultaneously achieving a 6.3× energy consumption reduction.  more » « less
Award ID(s):
2016725
PAR ID:
10484456
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
IEEE ICASSP-2023: Inter. Conf. Audio, Speech, and Signal Processing
Edition / Version:
Paper #1986
ISBN:
978-1-7281-6327-7
Page Range / eLocation ID:
1 to 5
Subject(s) / Keyword(s):
Keyword spotting filterbank learning small footprint noise robustness end-to-end
Format(s):
Medium: X Size: 1MB
Size(s):
1MB
Location:
Rhodes Island, Greece
Sponsoring Org:
National Science Foundation
More Like this
  1. Designing low-cost filterbanks is important due to severe resource limitations imposed by hearing aid size. Here, we develop a novel FIR filterbank employing stochastic computing (SC). SC-based filters use (pseudo)-random bitstreams to efficiently perform the core filtering operation. We demonstrate that SC is well-suited to low-cost filterbank design and compare our SC filterbank to a conventional sequential binary (SB) design. We show that the SC design achieves the same accuracy and latency as the SB one, with an exceptionally large 70% reduction in chip area. The power consumption of our proposed SC filterbank is 38-96% that of the SB design. 
    more » « less
  2. Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. 
    more » « less
  3. Abstract Objective. Neurological disorders affecting speech production adversely impact quality of life for over 7 million individuals in the US. Traditional speech interfaces like eye-tracking devices and P300 spellers are slow and unnatural for these patients. An alternative solution, speech brain-computer interfaces (BCIs), directly decodes speech characteristics, offering a more natural communication mechanism. This research explores the feasibility of decoding speech features using non-invasive EEG.Approach. Nine neurologically intact participants were equipped with a 63-channel EEG system with additional sensors to eliminate eye artifacts. Participants read aloud sentences selected for phonetic similarity to the English language. Deep learning models, including Convolutional Neural Networks and Recurrent Neural Networks with and without attention modules, were optimized with a focus on minimizing trainable parameters and utilizing small input window sizes for real-time application. These models were employed for discrete and continuous speech decoding tasks.Main results. Statistically significant participant-independent decoding performance was achieved for discrete classes and continuous characteristics of the produced audio signal. A frequency sub-band analysis highlighted the significance of certain frequency bands (delta, theta, gamma) for decoding performance, and a perturbation analysis was used to identify crucial channels. Assessed channel selection methods did not significantly improve performance, suggesting a distributed representation of speech information encoded in the EEG signals. Leave-One-Out training demonstrated the feasibility of utilizing common speech neural correlates, reducing data collection requirements from individual participants.Significance. These findings contribute significantly to the development of EEG-enabled speech synthesis by demonstrating the feasibility of decoding both discrete and continuous speech features from EEG signals, even in the presence of EMG artifacts. By addressing the challenges of EMG interference and optimizing deep learning models for speech decoding, this study lays a strong foundation for EEG-based speech BCIs. 
    more » « less
  4. Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization, speaker identification, and speech recognition. Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise and interfering speakers. However, SAD performance can be severely limited for conversations recorded in naturalistic environments due to dynamic acoustic scenarios and previously unseen non-speech artifacts. In this letter, we propose an end-to-end deep learning framework designed to be robust to time-varying noise profiles observed in naturalistic audio. We develop a novel SAD solution for the UTDallas Fearless Steps Apollo corpus based on NASA’s Apollo missions. The proposed system leverages spectro-temporal correlations with a threshold optimization mechanism to adjust to acoustic variabilities across multiple channels and missions. This system is trained and evaluated on the Fearless Steps Challenge (FSC) corpus (a subset of the Apollo corpus). Experimental results indicate a high degree of adaptability to out-of-domain data, achieving a relative Detection Cost Function (DCF) performance improvement of over 50% compared to the previous FSC baselines and state-of-the-art (SOTA) SAD systems. The proposed model also outperforms the most recent DL-based SOTA systems from FSC Phase-4. Ablation analysis is conducted to confirm the efficacy of the proposed spectro-temporal features. 
    more » « less
  5. We introduce a deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications. Our approach is based on a key observation about human speech: there is often a short pause between each sentence or word. In a recorded speech signal, those pauses introduce a series of time periods during which only noise is present. We leverage these incidental silent intervals to learn a model for automatic speech denoising given only mono-channel audio. Detected silent intervals over time expose not just pure noise but its time-varying features, allowing the model to learn noise dynamics and suppress it from the speech signal. Experiments on multiple datasets confirm the pivotal role of silent interval detection for speech denoising, and our method outperforms several state-of-the-art denoising methods, including those that accept only audio input (like ours) and those that denoise based on audiovisual input (and hence require more information). We also show that our method enjoys excellent generalization properties, such as denoising spoken languages not seen during training. 
    more » « less