skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Adaptation and Frontend Features to Improve Naturalness in Found-Data Synthesis
We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices – subset adaptation, and adding new acous- tic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis  more » « less
Award ID(s):
1717680
PAR ID:
10058671
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of Speech Prosody 2018
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices -- subset adaptation, and adding new acoustic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis. 
    more » « less
  2. Neuromorphic vision sensors (NVS), also known as silicon retina, capture aspects of the biological functionality of the mammalian retina by transducing incident photocurrent into an asynchronous stream of spikes that denote positive and negative changes in intensity. Current state-of-the-art devices are effectively leveraged in a variety of settings, but still suffer from distinct disadvantages as they are transitioned into high performance environments, such as space and autonomy. This paper provides an outline and demonstration of a data synthesis tool that gleans characteristics from the retina and allows the user to not only convert traditional video into neuromorphic data, but characterize design tradeoffs and inform future endeavors. Our retinomorphic model, RetinoSim, incorporates aspects of current NVS to allow for accurate data conversion while providing biologically-inspired features to improve upon this baseline. RetinoSim was implemented in MATLAB with a Graphical User Interface frontend to allow for expeditious video conversion and architecture exploration. We demonstrate that the tool can be used for real-time conversion for sparse event streams, exploration of frontend configurations, and duplication of existing event datasets. 
    more » « less
  3. Voice synthesis uses a voice model to synthesize arbitrary phrases. Advances in voice synthesis have made it possible to create an accurate voice model of a targeted individual, which can then in turn be used to generate spoofed audio in his or her voice. Generating an accurate voice model of target’s voice requires the availability of a corpus of the target’s speech. This paper makes the observation that the increasing popularity of voice interfaces that use cloud-backed speech recognition (e.g., Siri, Google Assistant, Amazon Alexa) increases the public’s vulnerability to voice synthesis attacks. That is, our growing dependence on voice interfaces fosters the collection of our voices. As our main contribution, we show that voice recognition and voice accumulation (that is, the accumulation of users’ voices) are separable. This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing. In essence, such methods employ audio processing techniques to remove distinctive voice characteristics, leaving only the information that is necessary for the cloud-based services to perform speech recognition. Our preliminary experiments show that our defenses prevent state-of-the-art voice synthesis techniques from constructing convincing forgeries of a user’s speech, while still permitting accurate voice recognition. 
    more » « less
  4. Lengthening and creaky voice are associated with prosodic finality in English. Listeners can use lengthening to identify both utterance-internal and final prosodic phrase boundaries and can use creak to locate utterance endings. Less is known about listeners' use of creak to locate internal prosodic boundaries and the relative importance assigned to duration and creak when both are present. Participants in two experiments segmented structurally ambiguous sentences in which duration and creak were manipulated to signal prosodic boundaries. When duration- and creak-based cues provided redundant information, their effects were additive. When these cues conflicted, the effect of creak was subtractive. 
    more » « less
  5. ISCA (Ed.)
    In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology. 
    more » « less