skip to main content


Title: Neural Variational Learning for Grounded Language Acquisition
We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms. We present a unified generative method to acquire a shared semantic/visual embedding that enables the learning of language about a wide range of real-world objects. We evaluate the efficacy of this learning by predicting the semantics of objects and comparing the performance with neural and non-neural inputs. We show that this generative approach exhibits promising results in language grounding without pre-specifying visual categories under low resource settings. Our experiments demonstrate that this approach is generalizable to multilingual, highly varied datasets.  more » « less
Award ID(s):
2024878 1940931 1657469 1813223
NSF-PAR ID:
10296639
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE ROMAN
ISSN:
1944-9437
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Learning and recognition can be improved by sorting novel items into categories and subcategories. Such hierarchical categorization is easy when it can be performed according to learned rules (e.g., “if car, then automatic or stick shift” or “if boat, then motor or sail”). Here, we present results showing that human participants acquire categorization rules for new visual hierarchies rapidly, and that, as they do, corresponding hierarchical representations of the categorized stimuli emerge in patterns of neural activation in the dorsal striatum and in posterior frontal and parietal cortex. Participants learned to categorize novel visual objects into a hierarchy with superordinate and subordinate levels based on the objects' shape features, without having been told the categorization rules for doing so. On each trial, participants were asked to report the category and subcategory of the object, after which they received feedback about the correctness of their categorization responses. Participants trained over the course of a one‐hour‐long session while their brain activation was measured using functional magnetic resonance imaging. Over the course of training, significant hierarchy learning took place as participants discovered the nested categorization rules, as evidenced by the occurrence of a learning trial, after which performance suddenly increased. This learning was associated with increased representational strength of the newly acquired hierarchical rules in a corticostriatal network including the posterior frontal and parietal cortex and the dorsal striatum. We also found evidence suggesting that reinforcement learning in the dorsal striatum contributed to hierarchical rule learning.

     
    more » « less
  2. null (Ed.)
    Abstract How does STEM knowledge learned in school change students’ brains? Using fMRI, we presented photographs of real-world structures to engineering students with classroom-based knowledge and hands-on lab experience, examining how their brain activity differentiated them from their “novice” peers not pursuing engineering degrees. A data-driven MVPA and machine-learning approach revealed that neural response patterns of engineering students were convergent with each other and distinct from novices’ when considering physical forces acting on the structures. Furthermore, informational network analysis demonstrated that the distinct neural response patterns of engineering students reflected relevant concept knowledge: learned categories of mechanical structures. Information about mechanical categories was predominantly represented in bilateral anterior ventral occipitotemporal regions. Importantly, mechanical categories were not explicitly referenced in the experiment, nor does visual similarity between stimuli account for mechanical category distinctions. The results demonstrate how learning abstract STEM concepts in the classroom influences neural representations of objects in the world. 
    more » « less
  3. null (Ed.)
    Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Hu-mans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on un-labeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% ac-curacy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name1. 
    more » « less
  4. We present a scalable approach for Detecting Objects by transferring Common-sense Knowledge (DOCK) from source to target categories. In our setting, the training data for the source categories have bounding box annotations, while those for the target categories only have image-level annotations. Current state-of-the-art approaches focus on image-level visual or semantic similarity to adapt a detector trained on the source categories to the new target categories. In contrast, our key idea is to (i) use similarity not at the image-level, but rather at the region-level, and (ii) leverage richer common-sense (based on attribute, spatial, etc.) to guide the algorithm towards learning the correct detections. We acquire such common-sense cues automatically from readily-available knowledge bases without any extra human effort. On the challenging MS COCO dataset, we find that common-sense knowledge can substantially improve detection performance over existing transfer-learning baselines. 
    more » « less
  5. In this paper we propose a new framework—MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long horizon, compositional tasks over recent works on the recently released benchmark data set -ALFRED. 
    more » « less