A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis

Kai-Zhan Lee, Erica Cooper

doi:DOI: 10.21437/Interspeech.2018-1313

Citation Details

A Comparison of Speaker-based and Utterance-based Data Selection for Text-to-Speech Synthesis

Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work. more »

Award ID(s):: 1717680

PAR ID:: 10097223

Author(s) / Creator(s):: Kai-Zhan Lee, Erica Cooper

Date Published:: 2018-01-01

Journal Name:: Interspeech 2018

Volume:: 12873-2877

Page Range / eLocation ID:: 2873-2877

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/DOI: 10.21437/Interspeech.2018-1313

More Like this