Acoustic word embeddings are fixed-dimensional representations of variable-length speech segments. In settings where unlabelled speech is the only available resource, such embeddings can be used in "zero-resource" speech search, indexing and discovery systems. Here we propose to train a single supervised embedding model on labelled data from multiple well-resourced languages and then apply it to unseen zero-resource languages. For this transfer learning approach, we consider two multilingual recurrent neural network models: a discriminative classifier trained on the joint vocabularies of all training languages, and a correspondence autoencoder trained to reconstruct word pairs. We test these using a word discrimination task on six target zero-resource languages. When trained on seven well-resourced languages, both models perform similarly and outperform unsupervised models trained on the zero-resource languages. With just a single training language, the second model works better, but performance depends more on the particular training--testing language pair.
more »
« less
DeepFry: Identifying Vocal Fry Using Deep Neural Networks
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch. It occurs in diverse languages and is prevalent in American English, where it is used not only to mark phrase finality, but also sociolinguistic factors and affect. Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems, particularly for languages where creak is frequently used. This paper proposes a deep learning model to detect creaky voice in fluent speech. The model is composed of an encoder and a classifier trained together. The encoder takes the raw waveform and learns a representation using a convolutional neural network. The classifier is implemented as a multi-headed fully-connected network trained to detect creaky voice, voicing, and pitch, where the last two are used to refine creak prediction. The model is trained and tested on speech of American English speakers, annotated for creak by trained phoneticians. We evaluated the performance of our system using two encoders: one is tailored for the task, and the other is based on a state-of-the-art unsupervised representation. Results suggest our best-performing system has improved recall and F1 scores compared to previous methods on unseen data.
more »
« less
- Award ID(s):
- 1944773
- PAR ID:
- 10391456
- Date Published:
- Journal Name:
- Interspeech 2022
- Page Range / eLocation ID:
- 3578 to 3582
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Phrase-level prosodic prominence in American English is understood, in the AM tradition, to be marked by pitch accents. While such prominences are characterized via tonal labels in ToBI (e.g. H*), their cues are not exclusively in the pitch domain: timing, loudness and voice quality are known to contribute to prominence perception. All of these cues occur with a wide degree of variability in naturally produced speech, and this variation may be informative. In this study, we advance towards a system of explicit labelling of individual cues to prosodic structure, here focusing on phrase-level prominence. We examine correlations between the presence of a set of 6 cues to prominence (relating to segment duration, loudness, and non-modal phonation, in addition to f0) and pitch accent labels in a corpus of ToBI-labelled American English speech. Results suggest that tokens with more cues are more likely to receive a pitch accent label.more » « less
-
Period-doubled voice consists of two alternating periods with multiple frequencies and is often perceived as rough with an indeterminate pitch. Past pitch-matching studies in period-doubled voice found that the perceived pitch was lower as the degree of amplitude and frequency modulation between the two alternating periods increased. The perceptual outcome also differed across f0s and modulation types: a lower f0 prompted earlier identification of a lower pitch, and the matched pitch dropped more quickly in frequency- than amplitude-modulated tokens (Sun & Xu, 2002; Bergan & Titze, 2001). However, it is unclear how listeners perceive period doubling when identifying linguistic tones. In an artificial language learning paradigm, this study used resynthesized stimuli with alternating amplitudes and/or frequencies of varying degrees, based on a production study of period-doubled voice (Huang, 2022). Listeners were native speakers of English and Mandarin. We confirm the positive relationship between the modulation degree and the proportion of low tones heard, and find that frequency modulation biased listeners to choose more low-tone options than amplitude modulation. However, a higher f0 (300 Hz) leads to a low-tone percept in more amplitude-modulated tokens than a lower f0 (200 Hz). Both English and Mandarin listeners behaved similarly, suggesting that pitch perception during period doubling is not language-specific. Furthermore, period doubling is predicted to signal low tones in languages, even when the f0 is high.more » « less
-
We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve SpanishEnglish ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English ST. Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1 BLEU.more » « less
-
Building on previous work in subset selection of training data for text-to-speech (TTS), this work compares speaker-level and utterance-level selection of TTS training data, using acoustic features to guide selection. We find that speaker-based selection is more effective than utterance-based selection, regardless of whether selection is guided by a single feature or a combination of features. We use US English telephone data collected for automatic speech recognition to simulate the conditions of TTS training on low-resource languages. Our best voice achieves a human-evaluated WER of 29.0% on semantically-unpredictable sentences. This constitutes a significant improvement over our baseline voice trained on the same amount of randomly selected utterances, which performed at 42.4% WER. In addition to subjective voice evaluations with Amazon Mechanical Turk, we also explored objective voice evaluation using mel-cepstral distortion. We found that this measure correlates strongly with human evaluations of intelligibility, indicating that it may be a useful method to evaluate or pre-select voices in future work.more » « less
An official website of the United States government

