skip to main content

Title: Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning
In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowledge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into more » perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations. « less
Award ID(s):
Publication Date:
Journal Name:
Advances in neural information processing systems
Sponsoring Org:
National Science Foundation
More Like this
  1. A major goal of grounded language learning research is to enable robots to connect language predicates to a robot’s physical interactive perception of the world. Coupling object exploratory behaviors such as grasping, lifting, and looking with multiple sensory modalities (e.g., audio, haptics, and vision) enables a robot to ground non-visual words like “heavy” as well as visual words like “red”. A major limitation of existing approaches to multi-modal language grounding is that a robot has to exhaustively explore training objects with a variety of actions when learning a new such language predicate. This paper proposes a method for guiding a robot’s behavioral exploration policy when learning a novel predicate based on known grounded predicates and the novel predicate’s linguistic relationship to them. We demonstrate our approach on two datasets in which a robot explored large sets of objects and was tasked with learning to recognize whether novel words applied to those objects.
  2. We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms. We present a unified generative method to acquire a shared semantic/visual embedding that enables the learning of language about a wide range of real-world objects. We evaluate the efficacy of this learning by predicting the semantics of objects and comparing the performance with neural and non-neural inputs. We show that this generative approach exhibits promising results in language grounding without pre-specifying visual categories under low resource settings. Our experiments demonstrate that this approach is generalizable to multilingual, highly varied datasets.
  3. Symmetry is ubiquitous in nature, in logic and mathematics, and in perception, language, and thought. Although humans are exquisitely sensitive to visual symmetry (e.g., of a butterfly), symmetry in natural language goes beyond visuospatial properties: many words point to abstract concepts with symmetrical content (e.g., equal, marry). For example, if Mark marries Bill, then Bill marries Mark. In both cases (vision and language), symmetry may be formally characterized as invariance under transformation. Is this a coincidence, or is there some deeper psychological resemblance? Here we asked whether representations of symmetry correspond across language and vision. To do so, we developed a novel cross-modal matching paradigm. On each trial, participants observed a visual stimulus (either symmetrical or non-symmetrical) and had to choose between a symmetrical and non-symmetrical English predicate unrelated to the stimulus (e.g., “negotiate” vs. “propose”). In a first study with visual events (symmetrical collision or asymmetrical launch), participants reliably chose the predicate matching the event’s symmetry. A second study showed that this “language-vision correspondence” generalized to objects, and was weakened when the stimuli’s binary nature was made less apparent (i.e., for one object, rather than two inward-facing objects). A final study showed the same effect when nonsigners guessed Englishmore »translations of signs from American Sign Language, which expresses many symmetrical concepts spatially. Taken together, our findings support the existence of an abstract representation of symmetry which humans access via both perceptual and linguistic means. More broadly, this work sheds light on the rich, structured nature of the language-cognition interface.« less
  4. Anwer, Nabil (Ed.)
    Design documentation is presumed to contain massive amounts of valuable information and expert knowledge that is useful for learning from the past successes and failures. However, the current practice of documenting design in most industries does not result in big data that can support a true digital transformation of enterprise. Very little information on concepts and decisions in early product design has been digitally captured, and the access and retrieval of them via taxonomy-based knowledge management systems are very challenging because most rule-based classification and search systems cannot concurrently process heterogeneous data (text, figures, tables, references). When experts retire or leave a design unit, industry often cannot benefit from past knowledge for future product design, and is left to reinvent the wheel repeatedly. In this work, we present AI-based Natural Language Processing (NLP) models which are trained for contextually representing technical documents containing texts, figures and tables, to do a semantic search for the retrieval of relevant data across large corpora of documents. By connecting textual and non-textual data through the use of an associative database, the semantic search question-answering system we developed can provide more comprehensive answers in the context of users’ questions. For the demonstration and assessment ofmore »this model, the semantic search question-answering system is applied to the Intergovernmental Panel on Climate Change (IPCC) Special Report 2019, which is more than 600 pages long and difficult to read and understand, even by most experts. Users can input custom queries relating to climate change concerns and receive evidence from the report that is contextually meaningful. We expect this method can transform current repositories of design documentation of heterogeneous data forms into structured knowledge-bases which can return relevant information efficiently as well as can evolve to embody manageable big data for the true digital transformation of design.« less
  5. We propose a cross-modality manifold alignment procedure that leverages triplet loss to jointly learn consistent, multi-modal embeddings of language-based concepts of real-world items. Our approach learns these embeddings by sampling triples of anchor, positive, and negative data points from RGB-depth images and their natural language descriptions. We show that our approach can benefit from, but does not require, post-processing steps such as Procrustes analysis, in contrast to some of our baselines which require it for reasonable performance. We demonstrate the effectiveness of our approach on two datasets commonly used to develop robotic-based grounded language learning systems, where our approach outperforms four baselines, including a state-of-the-art approach, across five evaluation metrics.