NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multimodal Language Learning for Object Retrieval in Low Data Regimes in the Face of Missing Modalities

Darvish, Kasra; Raff, Edward; Ferraro, Francis; Matuszek, Cynthia (October 2023, Transactions on machine learning research)

Our study is motivated by robotics, where when dealing with robots or other physical systems, we often need to balance competing concerns of relying on complex, multimodal data coming from a variety of sensors with a general lack of large representative datasets. Despite the complexity of modern robotic platforms and the need for multimodal interaction, there has been little research on integrating more than two modalities in a low data regime with the real-world constraint that sensors fail due to obstructions or adverse conditions. In this work, we consider a case in which natural language is used as a retrieval query against objects, represented across multiple modalities, in a physical environment. We introduce extended multimodal alignment (EMMA), a method that learns to select the appropriate object while jointly refining modality-specific embeddings through a geometric (distance-based) loss. In contrast to prior work, our approach is able to incorporate an arbitrary number of views (modalities) of a particular piece of data. We demonstrate the efficacy of our model on a grounded language object retrieval scenario. We show that our model outperforms state-of-the-art baselines when little training data is available. Our code is available at https://github.com/kasraprime/EMMA
more » « less
Full Text Available
A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

Kebe, Gaoussou Y.; Higgins, Padraig; Jenkins, Patrick; Darvish, Kasra; Sachdeva, Rishabh; Barron, Ryan; Winder, John; Engel, Don; Raff, Edward; Ferraro, Francis; et al (December 2021, Advances in neural information processing systems)

Grounded language acquisition is a major area of research combining aspects of natural language processing, computer vision, and signal processing, compounded by domain issues requiring sample efficiency and other deployment constraints. In this work, we present a multimodal dataset of RGB+depth objects with spoken as well as textual descriptions. We analyze the differences between the two types of descriptive language and our experiments demonstrate that the different modalities affect learning. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact, as well as how differences in the vernacular of these modalities impact results.
more » « less
Full Text Available
A Spoken Language Dataset of Descriptions for Speech-Based Grounded Language Learning

Kébé, Gaoussou Youssouf; Higgins, Padraig; Jenkins, Patrick; Darvish, Kasra; Sachdeva, Rishabh; Barron, Ryan; Winder, John; Engel, Don; Raff, Edward; Ferraro, Francis; et al (December 2021, Advances in neural information processing systems)

Grounded language acquisition is a major area of research combining aspects of natural language processing, computer vision, and signal processing, compounded by domain issues requiring sample efficiency and other deployment constraints. In this work, we present a multimodal dataset of RGB+depth objects with spoken as well as textual descriptions. We analyze the differences between the two types of descriptive language and our experiments demonstrate that the different modalities affect learning. This will enable researchers studying the intersection of robotics, NLP, and HCI to better investigate how the multiple modalities of image, depth, text, speech, and transcription interact, as well as how differences in the vernacular of these modalities impact results.
more » « less
Full Text Available
Learning Object Attributes with Category-Free Grounded Language from Deep Featurization

Richards, Luke E.; Darvish, Kasra; Matuszek, Cynthia (September 2020, Proceedings of the IEEERSJ International Conference on Intelligent Robots and Systems)
null (Ed.)
While grounded language learning, or learning the meaning of language with respect to the physical world in which a robot operates, is a major area in human-robot interaction studies, most research occurs in closed worlds or domain-constrained settings. We present a system in which language is grounded in visual percepts without using categorical constraints by combining CNN-based visual featurization with natural language labels. We demonstrate results comparable to those achieved using handcrafted features for specific traits, a step towards moving language grounding into the space of fully open world recognition.
more » « less
Full Text Available
Practical Cross-Modal Manifold Alignment for Robotic Grounded Language Learning

https://doi.org/10.1109/CVPRW53098.2021.00177

Nguyen, Andre T.; Richards, Luke E.; Kebe, Gaoussou Youssouf; Raff, Edward; Darvish, Kasra; Ferraro, Francis; Matuszek, Cynthia (June 2021, IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2021)

We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which require it for reasonable performance. We demonstrate the effectiveness of our approach on two datasets commonly used to develop robotic-based grounded language learning systems, where our approach outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.
more » « less
Full Text Available
Towards Making Virtual Human-Robot Interaction a Reality

Higgins, Padraig; Kebe, Gaoussou Youssouf; Berlier, Adam; Darvish, Kasra; Engel, Don; Ferraro, Francis; Matuszek, Cynthia (March 2021, Proc. of the 3rd International Workshop on Virtual, Augmented, and Mixed-Reality for Human-Robot Interactions (VAM-HRI))

For robots deployed in human-centric spaces, natural language promises an intuitive, natural interface. However, obtaining appropriate training data for grounded language in a variety of settings is a significant barrier. In this work, we describe using human-robot interactions in virtual reality to train a robot, combining fully simulated sensing and actuation with human interaction. We present the architecture of our simulator and our grounded language learning approach, then describe our intended initial experiments.
more » « less
Full Text Available

Search for: All records