skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on August 22, 2026

Title: An Age-Agnostic System for Robust Speaker Verification
In speaker verification (SV), the acoustic mismatch between children’s and adults’ speech leads to suboptimal performance when adult-trained SV systems are applied to chil- dren’s speaker verification (C-SV). While domain adaptation techniques can enhance performance on C-SV tasks, they often do so at the expense of significant degradation in performance on adults’ SV (A-SV) tasks. In this study, we propose an Age Agnostic Speaker Verification (AASV) system that achieves robust performance across both C-SV and A-SV tasks. Our approach employs a domain classifier to disentangle age-related attributes from speech and subsequently expands the embedding space using the extracted domain information, forming a unified speaker representation that is robust and highly discriminative across age groups. Experiments on the OGI and Vox- Celeb datasets demonstrate the effectiveness of our approach in bridging SV performance disparities, laying the foundation for inclusive and age-adaptive SV systems.  more » « less
Award ID(s):
2006979
PAR ID:
10646711
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ISCA
Date Published:
Page Range / eLocation ID:
41 to 45
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Speaker Verification (SV) systems trained on adults speech often underperform on children’s SV due to the acoustic mismatch, and limited children speech data makes fine-tuning not very effective. In this paper, we propose an innovative framework, a Gated Linear Unit adapter with Iterative Fine-Tuning (G-IFT), to enhance knowledge transfer efficiency between the high-resource adults speech domain and the low-resource chil- dren’s speech domain. In this framework, a Gated Linear Unit adapter is first inserted between the pre-trained speaker embedding model and the classifier. Then the classifier, adapter, and pre-trained speaker embedding model are optimized sequentially in an iterative way. This framework is agnostic to the type of the underlying architecture of the SV system. Our experiments on ECAPA-TDNN, ResNet, and X-vector architectures using the OGI and MyST datasets demonstrate that the G-IFT framework yields consistent reductions in Equal Error Rates compared to baseline methods. 
    more » « less
  2. Speech activity detection (SAD) serves as a crucial front-end system to several downstream Speech and Language Technology (SLT) tasks such as speaker diarization, speaker identification, and speech recognition. Recent years have seen deep learning (DL)-based SAD systems designed to improve robustness against static background noise and interfering speakers. However, SAD performance can be severely limited for conversations recorded in naturalistic environments due to dynamic acoustic scenarios and previously unseen non-speech artifacts. In this letter, we propose an end-to-end deep learning framework designed to be robust to time-varying noise profiles observed in naturalistic audio. We develop a novel SAD solution for the UTDallas Fearless Steps Apollo corpus based on NASA’s Apollo missions. The proposed system leverages spectro-temporal correlations with a threshold optimization mechanism to adjust to acoustic variabilities across multiple channels and missions. This system is trained and evaluated on the Fearless Steps Challenge (FSC) corpus (a subset of the Apollo corpus). Experimental results indicate a high degree of adaptability to out-of-domain data, achieving a relative Detection Cost Function (DCF) performance improvement of over 50% compared to the previous FSC baselines and state-of-the-art (SOTA) SAD systems. The proposed model also outperforms the most recent DL-based SOTA systems from FSC Phase-4. Ablation analysis is conducted to confirm the efficacy of the proposed spectro-temporal features. 
    more » « less
  3. Fearless Steps (FS) APOLLO is a + 50,000 hr audio resource established by CRSS-UTDallas capturing all communications between NASA-MCC personnel, backroom staff, and Astronauts across manned Apollo Missions. Such a massive audio resource without metadata/unlabeled corpus provides limited benefit for communities outside Speech-and-Language Technology (SLT). Supplementing this audio with rich metadata developed using robust automated mechanisms to transcribe and highlight naturalistic communications can facilitate open research opportunities for SLT, speech sciences, education, and historical archival communities. In this study, we focus on customizing keyword spotting (KWS) and topic detection systems as an initial step towards conversational understanding. Extensive research in automatic speech recognition (ASR), speech activity, and speaker diarization using manually transcribed 125 h FS Challenge corpus has demonstrated the need for robust domain-specific model development. A major challenge in training KWS systems and topic detection models is the availability of word-level annotations. Forced alignment schemes evaluated using state-of-the-art ASR show significant degradation in segmentation performance. This study explores challenges in extracting accurate keyword segments using existing sentence-level transcriptions and proposes domain-specific KWS-based solutions to detect conversational topics in audio streams. 
    more » « less
  4. Continuous speaker separation aims to separate overlapping speakers in real-world environments like meetings, but it often falls short in isolating speech segments of a single speaker. This leads to split signals that adversely affect downstream applications such as automatic speech recognition and speaker diarization. Existing solutions like speaker counting have limitations. This paper presents a novel multi-channel approach for continuous speaker separation based on multi-input multi-output (MIMO) complex spectral mapping. This MIMO approach enables robust speaker localization by preserving inter-channel phase relations. Speaker localization as a byproduct of the MIMO separation model is then used to identify single-talker frames and reduce speaker splitting. We demonstrate that this approach achieves superior frame-level sound localization. Systematic experiments on the LibriCSS dataset further show that the proposed approach outperforms other methods, advancing state-of-the-art speaker separation performance. 
    more » « less
  5. Cavicchio, Federica (Ed.)
    Accounts of speech perception disagree on how listeners demonstrate perceptual constancy despite considerable variation in the speech signal due to speakers’ coarticulation. According to the spectral contrast account, listeners’ compensation for coarticulation (CfC) results from listeners perceiving the target-segment frequencies differently depending on the contrastive effects exerted by the preceding sound’s frequencies. In this study, we reexamine a notable finding that listeners apparently demonstrate perceptual adjustments to coarticulation even when the identity of the speaker (i.e., the “source”) changes midway between speech segments. We evaluated these apparent across-talker CfC effects on the rationale that such adjustments to coarticulation would likely be maladaptive for perceiving speech in multi-talker settings. In addition, we evaluated whether such cross-talker adaptations, if detected, were modulated by prior experience. We did so by manipulating the exposure phase of three groups of listeners by (a) merely exposing them to our stimuli (b) explicitly alerting them to talker change or (c) implicitly alerting them to this change. All groups then completed identical test blocks in which we assessed their CfC patterns in within- and across-talker conditions. Our results uniformly demonstrated that, while all three groups showed robust CfC shifts in the within-talker conditions, no such shifts were detected in the across-talker condition. Our results call into question a speaker-neutral explanation for CfC. Broadly, this demonstrates the need to carefully examine the perceptual demands placed on listeners in constrained experimental tasks and to evaluate whether the accounts that derive from such settings scale up to the demands of real-world listening. 
    more » « less