Extensive TTS corpora exist for commercial systems created for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from ""found"" data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial systems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and better speaking characteristics than speakers in ""found"" genres. By examining characteristics of f0, energy, speaking rate, articulation, NHR, jitter, and shimmer in ""found” genres and comparing these to traditional TTS corpora, we have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some ""found” genres. By identifying similarities and differences, we are able to identify objective methods for selecting ""found"" data to build TTS systems for LRLs.
more »
« less
Characteristics of Text-to-Speech and Other Corpora
Extensive TTS corpora exist for commercial systems cre- ated for high-resource languages such as Mandarin, English, and Japanese. Speakers recorded for these corpora are typically instructed to maintain constant f0, energy, and speaking rate and are recorded in ideal acoustic environments, producing clean, consistent audio. We have been developing TTS systems from “found” data collected for other purposes (e.g. training ASR systems) or available on the web (e.g. news broadcasts, au- diobooks) to produce TTS systems for low-resource languages (LRLs) which do not currently have expensive, commercial sys- tems. This study investigates whether traditional TTS speakers do exhibit significantly less variation and better speaking char- acteristics than speakers in found genres. By examining char- acteristics of f0, energy, speaking rate, articulation, NHR, jit- ter, and shimmer in found genres and comparing these to tra- ditional TTS corpora, We have found that TTS recordings are indeed characterized by low mean pitch, standard deviation of energy, speaking rate, and level of articulation, and low mean and standard deviations of shimmer and NHR; in a number of respects these are quite similar to some found genres. By iden- tifying similarities and differences, we are able to identify ob- jective methods for selecting found data to build TTS systems for LRLs.
more »
« less
- Award ID(s):
- 1717680
- PAR ID:
- 10058672
- Date Published:
- Journal Name:
- Proceedings of Speech Prosody 2018
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.more » « less
-
Subset Selection, Adaptation, Gemination and Prosody Prediction for Amharic Text-to-Speech SynthesisWhile large TTS corpora exist for commercial sys- tems created for high-resource languages such as Man- darin, English, and Spanish, for many languages such as Amharic, which are spoken by millions of people, this is not the case. We are working with “found” data collected for other purposes (e.g. training ASR systems) or avail- able on the web (e.g. news broadcasts, audiobooks) to produce TTS systems for low-resource languages which do not currently have expensive, commercial systems. This study describes TTS systems built for Amharic from “found” data and includes systems built from di erent acoustic-prosodic subsets of the data, systems built from combined high and lower quality data using adaptation, and systems which use prediction of Amharic gemination to improve naturalness as perceived by evaluators.more » « less
-
Are written corpora useful for phonological research? Word frequency lists for low-resource languages have become ubiquitous in recent years [@Crubadan]. For many languages there is direct correspondence between their written forms and their alphabets, but it is not clear whether written corpora can adequately represent language use. We use 15 low-resource languages and compare several information-theoretic properties across three corpus types. We show that despite differences in origin and genre, estimates in one corpus are highly correlated with estimates in other corpora.more » « less
-
Period-doubled voice consists of two alternating periods with multiple frequencies and is often perceived as rough with an indeterminate pitch. Past pitch-matching studies in period-doubled voice found that the perceived pitch was lower as the degree of amplitude and frequency modulation between the two alternating periods increased. The perceptual outcome also differed across f0s and modulation types: a lower f0 prompted earlier identification of a lower pitch, and the matched pitch dropped more quickly in frequency- than amplitude-modulated tokens (Sun & Xu, 2002; Bergan & Titze, 2001). However, it is unclear how listeners perceive period doubling when identifying linguistic tones. In an artificial language learning paradigm, this study used resynthesized stimuli with alternating amplitudes and/or frequencies of varying degrees, based on a production study of period-doubled voice (Huang, 2022). Listeners were native speakers of English and Mandarin. We confirm the positive relationship between the modulation degree and the proportion of low tones heard, and find that frequency modulation biased listeners to choose more low-tone options than amplitude modulation. However, a higher f0 (300 Hz) leads to a low-tone percept in more amplitude-modulated tokens than a lower f0 (200 Hz). Both English and Mandarin listeners behaved similarly, suggesting that pitch perception during period doubling is not language-specific. Furthermore, period doubling is predicted to signal low tones in languages, even when the f0 is high.more » « less
An official website of the United States government

