NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Cross-lingual articulatory feature information transfer for speech recognition using recurrent progressive neural networks

https://doi.org/10.21437/Interspeech.2022-11202

Morshed, Mahir; Hasegawa-Johnson, Mark (September 2022, Proceedings of Interspeech 2022)

A system for the lateral transfer of information from end-to-end neural networks recognizing articulatory feature classes to similarly structured networks recognizing phone tokens is here proposed. The system connects recurrent layers of feature detectors pre-trained on a base language to recurrent layers of a phone recognizer for a different target language, this inspired primarily by the progressive neural network scheme. Initial experiments used detectors trained on Bengali speech for four articulatory feature classes—consonant place, consonant manner, vowel height, and vowel backness—attached to phone recognizers for four other Asian languages (Javanese, Nepali, Sinhalese, and Sundanese). While these do not currently suggest consistent performance improvements across different low-resource settings for target languages, irrespective of their genealogic or phonological relatedness to Bengali, they do suggest the need for further trials with different language sets, altered data sources and data configurations, and slightly altered network setups.
more » « less
Full Text Available
Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

https://doi.org/10.21437/Interspeech.2022-816

Ni, Junrui; Wang, Liming; Gao, Heting; Qian, Kaizhi; Zhang, Yang; Chang, Shiyu; Hasegawa-Johnson, Mark (September 2022, Proc. Interspeech 2022)

An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.
more » « less
Full Text Available
Discovering phonetic inventories with crosslingual automatic speech recognition

https://doi.org/10.1016/j.csl.2022.101358

Żelasko, Piotr; Feng, Siyuan; Moro Velázquez, Laureano; Abavisani, Ali; Bhati, Saurabhchand; Scharenborg, Odette; Hasegawa-Johnson, Mark; Dehak, Najim (July 2022, Computer Speech & Language)

Full Text Available
Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

https://doi.org/10.1109/TASLP.2022.3180684

Bhati, Saurabhchand; Villalba, Jesus; Zelasko, Piotr; Moro-Velazquez, Laureano; Dehak, Najim (January 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Self-supervised Semantic-driven Phoneme Discovery for Zero-resource Speech Recognition

https://doi.org/10.18653/v1/2022.acl-long.553

Wang, Liming; Feng, Siyuan; Hasegawa-Johnson, Mark; Yoo, Chang (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))
Muresan, Smaranda; Nakov, Preslav; Villavicencio, Aline (Ed.)
Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.
more » « less
Full Text Available
How Phonotactics Affect Multilingual and Zero-Shot ASR Performance

https://doi.org/10.1109/ICASSP39728.2021.9414478

Feng, Siyuan; Zelasko, Piotr; Moro-Velazquez, Laureano; Abavisani, Ali; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (June 2021, ICASSP)
null (Ed.)
The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.
more » « less
Full Text Available
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

https://doi.org/10.1109/ICASSP39728.2021.9414418

Wang, Liming; Wang, Xinsheng; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (June 2021, ICASSP)
null (Ed.)
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and5% alignment F1 scores respectively.
more » « less
Full Text Available
That Sounds Familiar: An Analysis of Phonetic Representations Transfer Across Languages

https://doi.org/10.21437/Interspeech.2020-2513

Żelasko, Piotr; Moro-Velázquez, Laureano; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (October 2020, Interspeech)
null (Ed.)
Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a mul-tilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations.In this work, we focus on gaining a deeper understanding ofhow general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We ob-serve significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates.Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages - an encouraging result for the low-resource speech community
more » « less
Full Text Available
A DNN-HMM-DNN Hybrid Model for Discovering Word-Like Units from Spoken Captions and Image Regions

https://doi.org/10.21437/Interspeech.2020-1148

Wang, Liming; Hasegawa-Johnson, Mark (October 2020, Interspeech)
null (Ed.)
Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work,we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMM-DNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of distinct words in an unknown language by simultaneously learning the map-ping from image regions to concepts (the first DNN), the map-ping from acoustic feature vectors to phones (the second DNN),and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and Speech-COCO datasets, our model achieves 62.4 % alignment accuracy and outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics.
more » « less
Full Text Available
Deep F-Measure Maximization for End-to-End Speech Understanding

https://doi.org/10.21437/Interspeech.2020-1949

Sarı, Leda; Hasegawa-Johnson, Mark (October 2020, Interspeech)
null (Ed.)
Spoken language understanding (SLU) datasets, like many other machine learning datasets, usually suffer from the label imbalance problem. Label imbalance usually causes the learned model to replicate similar biases at the output which raises the issue of unfairness to the minority classes in the dataset. In this work, we approach the fairness problem by maximizing the F-measure instead of accuracy in neural network model training.We propose a differentiable approximation to the F-measure and train the network with this objective using standard back-propagation. We perform experiments on two standard fairness datasets, Adult, and Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset. In all four of these tasks, F-measure maximization results in improved micro-F1 scores, with absolute improvements of up to8% absolute, as compared to models trained with the cross-entropy loss function. In the two multi-class SLU tasks, the proposed approach significantly improves class coverage, i.e.,the number of classes with positive recall.
more » « less
Full Text Available

« Prev Next »

Search for: All records