Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

Erica Cooper, Xinyue Wang

doi:DOI: 10.21437/Interspeech.2017-465

Citation Details

Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data

This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices. more »

Award ID(s):: 1717680

PAR ID:: 10097224

Author(s) / Creator(s):: Erica Cooper, Xinyue Wang

Date Published:: 2017-01-01

Journal Name:: Interspeech 2017

Volume:: 1

Page Range / eLocation ID:: 3971-3975

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/DOI: 10.21437/Interspeech.2017-465

More Like this