skip to main content


Title: Learning Object Attributes with Category-Free Grounded Language from Deep Featurization
While grounded language learning, or learning the meaning of language with respect to the physical world in which a robot operates, is a major area in human-robot interaction studies, most research occurs in closed worlds or domain-constrained settings. We present a system in which language is grounded in visual percepts without using categorical constraints by combining CNN-based visual featurization with natural language labels. We demonstrate results comparable to those achieved using handcrafted features for specific traits, a step towards moving language grounding into the space of fully open world recognition.  more » « less
Award ID(s):
1657469 1637937
NSF-PAR ID:
10208631
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the IEEERSJ International Conference on Intelligent Robots and Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A major goal of grounded language learning research is to enable robots to connect language predicates to a robot’s physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e.g., audio, haptics, and vision) enables a robot to ground non-visual words like “heavy” as well as visual words like “red”. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot’s behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate’s linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects. 
    more » « less
  2. We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms. We present a unified generative method to acquire a shared semantic/visual embedding that enables the learning of language about a wide range of real-world objects. We evaluate the efficacy of this learning by predicting the semantics of objects and comparing the performance with neural and non-neural inputs. We show that this generative approach exhibits promising results in language grounding without pre-specifying visual categories under low resource settings. Our experiments demonstrate that this approach is generalizable to multilingual, highly varied datasets. 
    more » « less
  3. In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowledge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations. 
    more » « less
  4. For robots deployed in human-centric spaces, natural language promises an intuitive, natural interface. However, obtaining appropriate training data for grounded language in a variety of settings is a significant barrier. In this work, we describe using human-robot interactions in virtual reality to train a robot, combining fully simulated sensing and actuation with human interaction. We present the architecture of our simulator and our grounded language learning approach, then describe our intended initial experiments. 
    more » « less
  5. There has been substantial work in recent years on grounded language acquisition, in which a model is learned that relates linguistic constructs to the perceivable world. While powerful, this approach is frequently hindered by ambiguities and omissions found in natural language. One such omission is the lack of negative descriptions of objects. We describe an unsupervised system that learns visual classifiers associated with words, using semantic similarity to automatically choose negative examples from a corpus of perceptual and linguistic data. We evaluate the effectiveness of each stage as well as the system's performance on the overall learning task. 
    more » « less