skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory
Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames, since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.  more » « less
Award ID(s):
1453781 1718944
PAR ID:
10104181
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Interspeech 2018
Page Range / eLocation ID:
1244 to 1248
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames, since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers. 
    more » « less
  2. Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal infor- mation within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective mem- ory of the units is limited to a few frames, since the recur- rent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including mul- tiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers. 
    more » « less
  3. Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization, speaker identification, and speech recognition. Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise and interfering speakers. However, SAD performance can be severely limited for conversations recorded in naturalistic environments due to dynamic acoustic scenarios and previously unseen non-speech artifacts. In this letter, we propose an end-to-end deep learning framework designed to be robust to time-varying noise profiles observed in naturalistic audio. We develop a novel SAD solution for the UTDallas Fearless Steps Apollo corpus based on NASA’s Apollo missions. The proposed system leverages spectro-temporal correlations with a threshold optimization mechanism to adjust to acoustic variabilities across multiple channels and missions. This system is trained and evaluated on the Fearless Steps Challenge (FSC) corpus (a subset of the Apollo corpus). Experimental results indicate a high degree of adaptability to out-of-domain data, achieving a relative Detection Cost Function (DCF) performance improvement of over 50% compared to the previous FSC baselines and state-of-the-art (SOTA) SAD systems. The proposed model also outperforms the most recent DL-based SOTA systems from FSC Phase-4. Ablation analysis is conducted to confirm the efficacy of the proposed spectro-temporal features. 
    more » « less
  4. We model coordination and coregulation patterns in 33 triads engaged in collaboratively solving a challenging computer programming task for approximately 20 minutes. Our goal is to prospectively model speech rate (words/sec) – an important signal of turn taking and active participation – of one teammate (A or B or C) from time lagged nonverbal signals (speech rate and acoustic-prosodic features) of the other two (i.e., A + B → C; A + C → B; B + C → A) and task-related context features. We trained feed-forward neural networks (FFNNs) and long short- term memory recurrent neural networks (LSTMs) using group- level nested cross-validation. LSTMs outperformed FFNNs and a chance baseline and could predict speech rate up to 6s into the future. A multimodal combination of speech rate, acoustic- prosodic, and task context features outperformed unimodal and bimodal signals. The extent to which the models could predict an individual’s speech rate was positively related to that individual’s scores on a subsequent posttest, suggesting a link between coordination/coregulation and collaborative learning outcomes. We discuss applications of the models for real-time systems that monitor the collaborative process and intervene to promote positive collaborative outcomes. 
    more » « less
  5. The widespread use of distributed energy sources (DERs) raises significant challenges for power system design, planning, and operation, leading to wide adaptation of tools on hosting capacity analysis (HCA). Traditional HCA methods conduct extensive power flow analysis. Due to the computation burden, these time-consuming methods fail to provide online hosting capacity (HC) in large distribution systems. To solve the problem, we first propose a deep learning-based problem formulation for HCA, which conducts offline training and determines HC in real time. The used learning model, long short-term memory (LSTM), implements historical time-series data to capture periodical patterns in distribution systems. However, directly applying LSTMs suffers from low accuracy due to the lack of consideration on spatial information, where location information like feeder topology is critical in nodal HCA. Therefore, we modify the forget gate function to dual forget gates, to capture the spatial correlation within the grid. Such a design turns the LSTM into the Spatial-Temporal LSTM (ST-LSTM). Moreover, as voltage violations are the most vital constraints in HCA, we design a voltage sensitivity gate to increase accuracy further. The results of LSTMs and ST-LSTMs on feeders, such as IEEE 34-, 123-bus feeders, and utility feeders, validate our designs. 
    more » « less