Learning to understand grounded language, which connects natural language to percepts, is a critical research area. Prior work in grounded language acquisition has focused primarily on textual inputs. In this work, we demonstrate the feasibility of performing grounded language acquisition on paired visual percepts and raw speech inputs. This will allow interactions in which language about novel tasks and environments is learned from end-users, reducing dependence on textual inputs and potentially mitigating the effects of demographic bias found in widely available speech recognition systems. We leverage recent work in self-supervised speech representation models and show that learned representations of speechmore »
On the contributions of visual and textual supervision in low-resource semantic speech retrieval
Recent work has shown that speech paired with images can be
used to learn semantically meaningful speech representations
even without any textual supervision. In real-world low-resource
settings, however, we often have access to some transcribed
speech. We study whether and how visual grounding is useful
in the presence of varying amounts of textual supervision. In
particular, we consider the task of semantic speech retrieval in a
low-resource setting. We use a previously studied data set and
task, where models are trained on images with spoken captions
and evaluated on human judgments of semantic relevance. We
propose a multitask learning approach to leverage both visual
and textual modalities, with visual supervision in the form of keyword
probabilities from an external tagger. We find that visual
grounding is helpful even in the presence of textual supervision,
and we analyze this effect over a range of sizes of transcribed
data sets. With ∼5 hours of transcribed speech, we obtain 23%
higher average precision when also using visual supervision.
- Award ID(s):
- 1816627
- Publication Date:
- NSF-PAR ID:
- 10108193
- Journal Name:
- Interspeech 2019
- ISSN:
- 2308-457X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work,we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMM-DNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of distinct words in an unknown language by simultaneously learning the map-ping from image regions to concepts (the first DNN), the map-ping from acoustic feature vectors to phones (the second DNN),and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and Speech-COCO datasets, our model achieves 62.4 %more »
-
This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task, which is more generalizable to unobserved data compared to merely reshaping the original representation space. In addition to modeling the relevance between the textual entities and visual entities, we model the higher-order relevance between entitymore »
-
We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms. We present a unified generative method to acquire a shared semantic/visual embedding that enables the learning of language about a wide range of real-world objects. We evaluate the efficacy of this learning by predicting the semantics of objects and comparing the performance with neural and non-neural inputs. We show that this generative approach exhibits promising results in language grounding without pre-specifying visual categories under low resource settings. Our experiments demonstrate that this approach is generalizable to multilingual, highly varied datasets.
-
We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve SpanishEnglish ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasksmore »