skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Text-Independent Speaker Verification Using 3D Convolutional Neural Networks
In this paper, a novel method using 3D Convolutional Neural Network (3D-CNN) architecture has been proposed for speaker verification in the text-independent setting. One of the main challenges is the creation of the speaker models. Most of the previously-reported approaches create speaker models based on averaging the extracted features from utterances of the speaker, which is known as the d-vector system. In our paper, we propose an adaptive feature learning by utilizing the 3D-CNN s for direct speaker model creation in which, for both development and enrollment phases, an identical number of spoken utterances per speaker is fed to the network for representing the speakers' utterances and creation of the speaker model. This leads to simultaneously capturing the speaker-related information and building a more robust system to cope with within-speaker variation. We demonstrate that the proposed method significantly outperforms the traditional d-vector verification system. Moreover, the proposed system can also be an alternative to the traditional d-vector system which is a one-shot speaker modeling system by utilizing 3D-CNNs.  more » « less
Award ID(s):
1650474
PAR ID:
10091245
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE International Conference on Multimedia and Expo (ICME)
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The manner in which acoustic features contribute to perceiving speaker identity remains unclear. In an attempt to better understand speaker perception, we investigated human and machine speaker discrimination with utterances shorter than 2 seconds. Sixty-five listeners performed a same vs. different task. Machine performance was estimated with i-vector/PLDA-based automatic speaker verification systems, one using mel-frequency cepstral coefficients (MFCCs) and the other using voice quality features (VQual2) inspired by a psychoacoustic model of voice quality. Machine performance was measured in terms of the detection and log-likelihood-ratio cost functions. Humans showed higher confidence for correct target decisions compared to correct non-target decisions, suggesting that they rely on different features and/or decision making strategies when identifying a single speaker compared to when distinguishing between speakers. For non-target trials, responses were highly correlated between humans and the VQual2-based system, especially when speakers were perceptually marked. Fusing human responses with an MFCC-based system improved performance over human-only or MFCC-only results, while fusing with the VQual2-based system did not. The study is a step towards understanding human speaker discrimination strategies and suggests that automatic systems might be able to supplement human decisions especially when speakers are marked. 
    more » « less
  2. This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers. 
    more » « less
  3. Speaker diarization determines who spoke and when? in an audio stream. In this study, we propose a model-based approach for robust speaker clustering using i-vectors. The i-vectors extracted from different segments of same speaker are correlated. We model this correlation with a Markov Random Field (MRF) network. Leveraging the advancements in MRF modeling, we used Toeplitz Inverse Covariance (TIC) matrix to represent the MRF correlation network for each speaker. This approaches captures the sequential structure of i-vectors (or equivalent speaker turns) belonging to same speaker in an audio stream. A variant of standard Expectation Maximization (EM) algorithm is adopted for deriving closed-form solution using dynamic programming (DP) and the alternating direction method of multiplier (ADMM). Our diarization system has four steps: (1) ground-truth segmentation; (2) i-vector extraction; (3) post-processing (mean subtraction, principal component analysis, and length-normalization) ; and (4) proposed speaker clustering. We employ cosine K-means and movMF speaker clustering as baseline approaches. Our evaluation data is derived from: (i) CRSS-PLTL corpus, and (ii) two meetings subset of the AMI corpus. Relative reduction in diarization error rate (DER) for CRSS-PLTL corpus is 43.22% using the proposed advancements as compared to baseline. For AMI meetings IS1000a and IS1003b, relative DER reduction is 29.37% and 9.21%, respectively. 
    more » « less
  4. In this paper, a data-driven method is proposed for fast cascading outage screening in power systems. The proposed method combines a deep convolutional neural network (deep CNN) and a depth-first search (DFS) algorithm. First, a deep CNN is constructed as a security assessment tool to evaluate system security status based on observable information.With its automatic feature extraction ability and the high generalization, a well-trained deep CNN can produce estimated AC optimal power flow (ACOPF) results for various uncertain operation scenarios, i.e., fluctuated load and system topology change, in a nearly computation-free manner. Second, a scenario tree is built to represent the potential operation scenarios and the associated cascading outages. The DFS algorithm is developed as a fast screening tool to calculate the expected security index value for each cascading outage path along the entire tree, which can be a reference for system operators to take predictive measures against system collapse. The simulation results of applying the proposed deep CNN and the DFS algorithm on standard test cases verify their accuracy, and the computational efficiency is thousands of times faster than the model-based traditional approach, which implies the great potential of the proposed algorithm for online applications. 
    more » « less
  5. Golpira, Hemin (Ed.)
    The paper proposes an approach for fast small signal stability assessment on a short data window using deep learning algorithms. This paper shows that the proposed deep convolutional neural networks (CNNs)-based assessment approach is faster than traditional methods (i.e. Prony’s method). The evaluated CNNs are fully convolutional network (FCN), CNN with sub-sampling steps performed through max pooling (Time LeNet), time CNN, fully convolutional network with attention mechanism (Encoder), and CNN with a shortcut residual connection (ResNet). The proposed approach is validated on different synthetic measurement data sets generated from the IEEE 9-bus system that is used as a reference, and further applied to a 769-bus system representing a region in the U. S. Eastern Interconnection. We show that precision and recall are more informative metrics than accuracy for the reliability of the stability assessment process using the proposed methodology. In addition, the method’s efficiency is compared to classical Prony method. 
    more » « less