skip to main content

Title: Subset Selection, Adaptation, Gemination and Prosody Prediction for Amharic Text-to-Speech Synthesis
While large TTS corpora exist for commercial sys- tems created for high-resource languages such as Man- darin, English, and Spanish, for many languages such as Amharic, which are spoken by millions of people, this is not the case. We are working with “found” data collected for other purposes (e.g. training ASR systems) or avail- able on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages which do not currently have expensive, commercial systems. This study describes TTS systems built for Amharic from “found” data and includes systems built from di erent acoustic-prosodic subsets of the data, systems built from combined high and lower quality data using adaptation, and systems which use prediction of Amharic gemination to improve naturalness as perceived by evaluators.
Authors:
; ; ; ;
Award ID(s):
1717680
Publication Date:
NSF-PAR ID:
10174744
Journal Name:
10th ISCA Speech Synthesis Workshop
Page Range or eLocation-ID:
205 to 210
Sponsoring Org:
National Science Foundation
More Like this
  1. Extensive TTS corpora exist for commercial systems created for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from ""found"" data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial systems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and bettermore »speaking characteristics than speakers in ""found"" genres. By examining characteristics of f0, energy, speaking rate, articulation, NHR, jitter, and shimmer in ""found” genres and comparing these to traditional TTS corpora, we have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some ""found” genres. By identifying similarities and differences, we are able to identify objective methods for selecting ""found"" data to build TTS systems for LRLs.« less
  2. Extensive TTS corpora exist for commercial systems cre- ated for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from “found” data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, au- diobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial sys- tems. This study investigates whether traditional TTS speakers do exhibit significantly lessmore »variation and better speaking char- acteristics than speakers in found genres. By examining char- acteristics of f0, energy, speaking rate, articulation, NHR, jit- ter, and shimmer in found genres and comparing these to tra- ditional TTS corpora, We have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some found genres. By iden- tifying similarities and differences, we are able to identify ob- jective methods for selecting found data to build TTS systems for LRLs.« less
  3. Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained onmore »the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work.« less
  4. This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense ofmore »recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.« less
  5. Research software, which includes both source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning andmore »organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g., Python, R, JavaScript, C++, MATLAB) are used by a large number, there is a long tail of languages used by relatively few. Between-group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Differences between researchers from computer science and other disciplines related to the knowledge of best practices of software creation and sharing were not statistically significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community.

    « less