skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adaptation and Frontend Features to Improve Naturalness in Found-Data Synthesis
We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices -- subset adaptation, and adding new acoustic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis.  more » « less
Award ID(s):
1717680
PAR ID:
10097219
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Speech Prosody 2018
Volume:
1
Page Range / eLocation ID:
794 to 798
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices – subset adaptation, and adding new acous- tic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis 
    more » « less
  2. ISCA (Ed.)
    In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology. 
    more » « less
  3. We model coordination and coregulation patterns in 33 triads engaged in collaboratively solving a challenging computer programming task for approximately 20 minutes. Our goal is to prospectively model speech rate (words/sec) – an important signal of turn taking and active participation – of one teammate (A or B or C) from time lagged nonverbal signals (speech rate and acoustic-prosodic features) of the other two (i.e., A + B → C; A + C → B; B + C → A) and task-related context features. We trained feed-forward neural networks (FFNNs) and long short- term memory recurrent neural networks (LSTMs) using group- level nested cross-validation. LSTMs outperformed FFNNs and a chance baseline and could predict speech rate up to 6s into the future. A multimodal combination of speech rate, acoustic- prosodic, and task context features outperformed unimodal and bimodal signals. The extent to which the models could predict an individual’s speech rate was positively related to that individual’s scores on a subsequent posttest, suggesting a link between coordination/coregulation and collaborative learning outcomes. We discuss applications of the models for real-time systems that monitor the collaborative process and intervene to promote positive collaborative outcomes. 
    more » « less
  4. Acoustic cues are characteristic patterns in the speech signal that provide lexical, prosodic, or additional information, such as speaker identity. In particular, acoustic cues related to linguistic distinctive features can be extracted and marked from the speech signal. These acoustic cues can be used to infer the intended underlying phoneme sequence in an utterance. This study describes a framework for labeling acoustic cues in speech, including a suite of canonical cue prediction algorithms that facilitates manual labeling and provides a standard for analyzing variations in the surface realizations. A brief examination of subsets of annotated speech data shows that labeling acoustic cues opens the possibility of detailed analyses of cue modification patterns in speech. 
    more » « less
  5. Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work. 
    more » « less