<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021 December</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356055</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Minkyu Choi Yizhen Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowledge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Humans take much longer time to name a colored word when the color and the word mismatch (e.g., "red" shown in green) than when they match (e.g., "red" shown in red) <ref type="bibr">[1]</ref>. This effect is an example of rich psychological evidence suggesting that humans learn language by grounding meanings to knowledge about the world <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>. In contrast, most models in natural language processing (NLP) <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref> encode "distributional semantics" <ref type="bibr">[8]</ref> learned from texts only. Put yourself as machines in a thought experiment for the "Chinese Room Argument" <ref type="bibr">[9]</ref>. Imagine that you have to learn Chinese from scratch as your first language. All that you have is a Chinese-to-Chinese dictionary. You might be able to relate a word to other words based on textual distributions. It is, however, impossible to learn word meanings without any additional explanation in reference to the physical world <ref type="bibr">[10]</ref>.</p><p>A language model may learn concepts from texts paired with sensory data, such as images. Joint vision-language learning has been explored for image captioning <ref type="bibr">[11]</ref>, visual question answering <ref type="bibr">[12]</ref>, and pre-training vision models with weak supervision <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. In line with these studies, we train a language model and a vision model jointly to match images and texts. We further analyze the semantic space obtained with the visually grounded language model. In this space, semantic embeddings are found to be organized and clustered by visual attributes, predictive of human-defined norms of semantic features, useful for compositional language understanding and cross-modal image search. We expect this visually grounded language model to also be useful for understanding the computational basis of grounded cognition <ref type="bibr">[15,</ref><ref type="bibr">16]</ref>.</p><p>Figure <ref type="figure">1</ref>: Visual grounding of natural language (see Section 3.1). The visual and language streams take an image and its caption as input, respectively. The inner-product between the visual feature maps and the contextual word embeddings forms the 3D match-map that highlights the matching between visual and language content. The similarity score calculated from the match-map (see Eq. 1) is used to evaluate the cross-modal contrastive loss.</p><p>2 Background and Related Work</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Distributional vs. grounding hypothesis</head><p>In the distributional hypothesis <ref type="bibr">[17]</ref>, words that occur in similar contexts carry similar meanings. This hypothesis has motivated influential machine learning models to learn word embeddings from large text corpora <ref type="bibr">[4,</ref><ref type="bibr">18]</ref>. However, the learned word embeddings are not straightforward to interpret <ref type="bibr">[8,</ref><ref type="bibr">19]</ref>. Alternatively, the symbol grounding hypothesis suggests that a word is connected to its meaning by relating to its referent in the physical world <ref type="bibr">[10,</ref><ref type="bibr">20]</ref>. In line with this hypothesis, earlier studies demonstrate that visual features or contexts can enhance language learning <ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">[25]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Vision-language learning</head><p>Grounding language in vision has been of increasing interest in computational linguistics and machine learning. A common strategy is to fuse words with related visual information in terms of perceptual norms <ref type="bibr">[21]</ref>, bag-of-visual-word <ref type="bibr">[23,</ref><ref type="bibr">26]</ref>, or learnable visual features <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref>. The models used for vision-language fusion evolve alongside those for NLP, such as Latent Dirichlet Allocation (LDA) <ref type="bibr">[23]</ref>, log-bilinear model <ref type="bibr">[31]</ref>, Skip-gram model <ref type="bibr">[25,</ref><ref type="bibr">27]</ref>, and recurrent neural network <ref type="bibr">[28]</ref>. Generally, visual grounding may refine the distribution and interpretability of language representations <ref type="bibr">[23,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref> and facilitate cross-modal tasks <ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref>. More recent work has begun to use transformer <ref type="bibr">[6,</ref><ref type="bibr">34]</ref> for vision-language learning, showing strong performance in cross-modal tasks <ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref><ref type="bibr">[38]</ref> .</p><p>Contrastive learning <ref type="bibr">[39,</ref><ref type="bibr">40]</ref> is increasingly applied to not only unimodal data <ref type="bibr">[41]</ref> but also multimodal data <ref type="bibr">[13,</ref><ref type="bibr">42]</ref>. It is able to learn better representations than alternative prediction or classification objectives <ref type="bibr">[43]</ref>. However, cross-modal contrastive learning is still under-explored for higher-level tasks, e.g., visual question answering <ref type="bibr">[12]</ref>, visual reasoning <ref type="bibr">[44]</ref>, scene graph generation <ref type="bibr">[45]</ref>. Such tasks involve abstract reasoning about the relations between entities (e.g., visual objects). Prior work approaches relational inference with multi-layer perceptron <ref type="bibr">[46,</ref><ref type="bibr">47]</ref> or graph neural networks <ref type="bibr">[48]</ref><ref type="bibr">[49]</ref><ref type="bibr">[50]</ref>. Arguably, a more compelling idea <ref type="bibr">[51]</ref> is to model entities as vectors in a continuous space and to model their relations as arithmetic operators (linear <ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref> or bilinear <ref type="bibr">[55,</ref><ref type="bibr">56]</ref>) applied to the vector representations of those entities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Relation to prior work</head><p>In this work, we first build a two-stream model to jointly learn visual and language representation from image-caption pairs, similar to recent work <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. We then finetune the learned model by adding a cross-modal attention layer <ref type="bibr">[35,</ref><ref type="bibr">36]</ref> and bilinear operators <ref type="bibr">[55]</ref> to represent the relations between visual objects. Both stages utilize cross-modal contrastive loss. Related to our work, Harwath et al. match visual objects to spoken words using triplet loss <ref type="bibr">[42]</ref>. Early this year, Jia et al. <ref type="bibr">[13]</ref> and Radford and Kim et al. <ref type="bibr">[14]</ref> use contrastive learning to pretrain a vision model using a massive image-text dataset and demonstrate largely improved zero-shot transfer learning performance on visual and cross-modal tasks. Different from their perspectives, we focus on assessing the language encoders and word representations. Specifically, we perform a systematic evaluation of the semantic space grounded in vision vs. the ungrounded semantic space learned from texts only. This evaluation is possible since after training, the language and visual streams in our model are fully separable as stand-alone systems, unlike some vision-language models that require both visual and textual input to be usable <ref type="bibr">[35,</ref><ref type="bibr">36]</ref>. Our goal is to assess how visual grounding affects the distribution of textual representations by analyzing the distribution of word embeddings in the grounded semantic space, in line with related works <ref type="bibr">[23,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">57]</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Approach</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Visual grounding of natural language</head><p>To build a computational model for learning visually grounded language representations, we develop a model (Fig. <ref type="figure">1</ref>) that combines a stand-alone visual stream and a stand-alone language stream. The visual stream is based on VGG16 <ref type="bibr">[58]</ref> with an additional linear transformation as an embedder to match the feature dimension of the language stream and an additional multi-head self-attention layer <ref type="bibr">[59]</ref> to enforce global information aggregation and learn long-range dependency. The language stream is based on Bert <ref type="bibr">[6]</ref>. Using separate linear transformation heads <ref type="bibr">[41]</ref>, the output from both the visual stream and the language stream are projected to a common representational space. In this common space, the inner-product between the visual representation V at every location and the language representation L of every word gives rise to a 3D match-map, where each element indicates how a word in the text matches each location in the image (See illustration in Fig. <ref type="figure">1</ref>). The sum of the maximal match is the similarity score S(V , L) between a pair of image and text. See Eq. 1, where i, j indicate the location in the 2D image feature map V and k indicates the k-th word in L.</p><p>Extending the unimodal normalized temperature-scaled cross-entropy (NT-Xent) loss <ref type="bibr">[13,</ref><ref type="bibr">41,</ref><ref type="bibr">60]</ref>, we define the cross-modal contrastive loss using the anchor sample from one modality and the positive sample and negative samples from the other modality <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. As such, we define and sum two loss functions with the anchor sample from either images or texts and positive/negative samples from either texts or images, respectively.</p><p>For Loss l in Eq. 2, the anchor sample V i is an input image and the positive sample L i is the corresponding image caption, whereas the negative samples L j are unmatched textual descriptions included in the same batch (B is the batch size). Similarly, Loss v in Eq. 2 is defined to contrast the positive and negative image samples against an anchor textual sample.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Visual grounding of object relations</head><p>We further finetune the model for visual relation prediction, as illustrated in Fig. <ref type="figure">2</ref>. In this stage, we remove the linear transformation heads in Fig. <ref type="figure">1</ref> and add a multi-head cross-modal attention module <ref type="bibr">[35,</ref><ref type="bibr">36]</ref>. The attention module uses a query based on the embedding of an object word from the language stream (Query L ) and uses keys (Key V ) and values (Value V ) from every location in the feature map output from the visual stream. The attention score is calculated as the inner-product of Query L and Key V followed by softmax. The attention-weighted sum of Value V is concatenated across 8 attention heads to generate a visually grounded object representation. </p><p>For visual relation prediction, we also use contrastive learning with two loss functions by taking either relation embedding or subject/object representations as positive/negative samples. </p><p>In Loss rel , K rel is the set that contains all relations available. The anchor sample is a pair of subject and object in an image. The positive sample is the embedding of the ground truth relation. The negative samples are the embeddings of all other relations. In Loss obj , the anchor sample is a given relation. The positive sample is a subject-object pair that holds this relation. The negative samples are other subject-object pairs in a different relation. For both loss functions, the positive and negative samples are drawn from the same batch B.</p><p>In addition, we also add a classification head (two fully connected layers with ReLU in between) and apply it to the grounded object representation. We use object classification as an auxiliary objective (with a cross-entropy loss) to constrain the grounded object representation to be separable across objects for classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Training and Testing</head><p>We train the model in three stages to progressively refine the model with increasingly demanding tasks.</p><p>In the first stage, we pretrain the visual and language streams separately as image and text encoders. The language stream is the pretrained Bert<ref type="foot">foot_1</ref> used as the baseline model for subsequent experiments. The visual stream is pretrained for object classification with ImageNet <ref type="bibr">[61]</ref>. Relative to the baseline CNN, the inclusion of self-attention improves the top-1 classification accuracy from 71.6% to 74.3% on the ImageNet validation dataset. The attention module also renders the classification more robust when the input image is partially occluded (See details in Appendix A.1).</p><p>In the second stage, we refine the pretrained language and visual streams by matching texts to images, as illustrated in Fig. <ref type="figure">1</ref> on the MS COCO dataset <ref type="bibr">[11]</ref>. While freezing other layers, we refine the self-attention layer in the visual stream and the top k layers in Bert (by default k = 8). Training with contrastive learning is based on the MS COCO dataset. As five captions are available for each image, we randomly sample one caption per image in each iteration. Earlier grounding (larger k) tends to support better image-text retrieval performance (see details in Appendix A.2).</p><p>In the third stage, we further finetune the model for visual relation prediction as illustrated in Fig. <ref type="figure">2</ref>.</p><p>We refine the visual self-attention layer and the higher l layers in Bert (by default l = 2) based on the Visual Genome dataset <ref type="bibr">[45]</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Principal components of grounded semantic representations</head><p>To evaluate the visually grounded semantic space, we use the language stream as a stand-alone model to extract the output representations of commonly used English words in the SemCat dataset <ref type="bibr">(9,</ref><ref type="bibr">197</ref> words; 100 word categories) <ref type="bibr">[62]</ref>. Details about how word representations are extracted from the language stream are explained in Appendix B. We apply the principal component analysis to the representations of all the words studied here and examine the top components as the principal dimensions of the grounded semantic space.  Interestingly, the first principal dimension is readily interpretable as an abstract-to-concrete axis (Fig. <ref type="figure">3</ref>). For example, words with the highest values in this axis are ostrich, seagull, albatross, blender, pelican, broccoli, parakeet, lettuce, sailboat, vegetables, whereas words with the lowest values are displeasure, liking, to, outgoing, present, experienced, profitable, faithful, meaningful, multitude.</p><p>The representations of words along this axis is significantly correlated with human rating of their concreteness (ranging from 1 to 5) from prior study <ref type="bibr">[63]</ref> (Fig. <ref type="figure">3</ref>). The Pearson correlation coefficient reaches 0.8749 or 0.6615 across word categories or individual words, respectively; after grounding with object relations: r = 0.8001 for categories, r = 0.6948 for words. In contrast, the principal axis of the ungrounded semantic space learned from the baseline Bert model is not straightforward to interpret and shows a weak correlation with human ratings of concreteness (Table . 1). Other principal components are also intuitively interpretable. For example, PC 2 captures the human vs. non-human axis, PC 3 captures the scene vs. object axis, PC 4 captures the natural vs. artificial axis, PC 5 captures the indoor vs. outdoor axis, PC 6 highlights words related to food. See results about other principal components in Appendix B.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Relation to human-defined norms of semantic features</head><p>We further ask whether the visually grounded word embeddings are amenable to binary semantic features defined by humans <ref type="bibr">[25,</ref><ref type="bibr">64]</ref>. We use the concept property norm dataset from the Centre for Speech, Language and the Brain (CSLB) <ref type="bibr">[65]</ref>. The dataset includes binary semantic features (e.g., has_wheels) labeled for 638 concepts collected from 123 human participants. We keep 390 features that each contains at least 5 samples. We hypothesize that the grounded word embeddings can be readout with a linear and sparse projection to readily support binary classification attainable by humans. To test this hypothesis, we train a logistic regression model with L1 regularization to predict each binary semantic feature from the grounded word embeddings and also repeat this for ungrounded semantics for comparison. See Appendix B.2 for details about this dataset and our evaluation method.</p><p>Results suggest that the grounded word embeddings are significantly more predictive of visually relevant binary features than ungrounded counterparts obtained by Bert (Wilcoxon Signed Rank Test; p &lt; 0.0001) (Fig. <ref type="figure">4</ref>). This difference is less pronounced but still significant for other features related to other perceptual (e.g., has_flavors), functional (e.g., does_cut), encyclopaedic (e.g., is_dangerous), and taxonomic features (e.g., is_clothing), especially after visual grounding of object relations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Clustering of word representations</head><p>After visual grounding, the semantic representations tend to group themselves based on perceptual similarity. We use the SemCat dataset (9, 197 English words from N = 100 categories) <ref type="bibr">[62]</ref> and calculate the Silhouette coefficient (between -1 and 1) to measure the degree to which these words are clustered by categories. The distance between word embeddings is measured as the cosine distance (See details in Appendix B.3). The Silhouette coefficients across 100 categories are significantly higher for the visually grounded semantics than ungrounded ones (Wilcoxon Signed Rank Test; p &lt; 0.0001) (Fig. <ref type="figure">5 left</ref>). The greatest gain in clustering are noticeable for categories that include concrete concepts (e.g. car, housing, mammal) with defining visual attributes (Fig. <ref type="figure">5</ref> right). For some abstract categories related to human emotion (e.g., happy), the grounded representations are also better clustered than the ungrounded ones. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Visually informed compositional reasoning</head><p>A drawback of distributional semantics is the inability to make visually informed compositional reasoning. We know that "zebra is a horse with black and white stripes", because we have seen how zebra looks like, whereas an ungrounded language model is never or rarely exposed to such The right part shows the corresponding results after visual grounding of natural language. Orange lines indicate words with increased ranking after visual grounding and blue lines for the decreased cases. We highlight in red the target word "zebra" for this specific example, which shows a significant increase in cosine similarity (from 0.12 to 0.60) and ranking (from 2914 to 12 out of 6238 unique words). Besides, the top words similar to "striped horse" are all horse-like animals after visual grounding, but this is not the case for the ungrounded Bert model.</p><p>information <ref type="bibr">[10]</ref>. We test whether the visually grounded semantics can perform compositional reasoning based on visual knowledge, without being explicit trained to do so. We choose some words (Table <ref type="table">2</ref>), for which the meaning can be intuitively inferred from the combination of other words.</p><p>Table <ref type="table">2</ref>: Examples of visually informed conceptual composition. Each row shows the cosine similarity and its ranking in the vocabulary (unique words in the Semcat dataset) between the query phrase and the target word. Except (hot weather, summer), all others are concepts supported by composition of visual knowledge in the query phrase. For each case, the highest similarity are rank are in bold. For example, we use a phrase "striped horse" as a compositional query to search for the matched words ranked in terms of cosine similarity. In Fig. <ref type="figure">6</ref>, the left part shows the cosine similarity and ranking between each of the listed words and the query phrase "striped horse" before visual grounding. The right part shows the corresponding results after visual grounding of natural language. With the grounded semantic representation, the phrase striped horse is highly similar to the word zebra (cosine similarity: 0.60), which is ranked as the 12-th in the vocabulary. After further grounding the language model with visual object relations, the target word zebra has an even higher cosine similarity of 0.63 ranked the 8-th in the vocabulary (Table <ref type="table">2</ref>). Other top-ranked words all refer to horse-like animals (i.e., horse, mule, mare, stallion, donkey, camel, antelope). This is in sharp contrast to the ungrounded semantic space, in which it is impossible to relate striped horse to zebra based on the similarity of their representations (cosine similarity: 0.12; rank: 2,914). The ungrounded Bert model highlights the top-3 similar words as tomcat, seahorse, squirrel, which are animals sharing fewer visual features with horse-like animals. See other examples in (Table <ref type="table">2</ref> and Appendix B.4). In our model, the cross-attention module forms a joint representational space to combine both visual and textual input. We explore whether this joint space can be used to support cross-modal tasks, e.g., image search based on image, text, or their combinations <ref type="bibr">[13]</ref>. For this task, we add two additional heads (F V and F L ). Each head includes two linear layers with ReLU in between followed by average pooling (See details in Appendix B.5). It is applied to either visual or textual representations in the joint space and results in a single vector representation for an image or a text (Eq. 6, d = 768). While freezing our model described in Section 3.2, we train the two additional heads with contrastive loss to match the average-pooled representations of paired images and texts in terms of their cosine similarity using the MS COCO dataset. To use the model for image search, we apply weighted sum to the normalized representations of a query image and a query text (with weights: 1 -&#945; and &#945;, where 0 &#8804; &#945; &#8804; 1). We use this multimodal query (Eq. 7) to search a held-out database<ref type="foot">foot_2</ref> for the matched images ranked in terms of cosine similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Multimodal image search</head><p>As &#945; controls the weighting between the textual and visual queries, we test how the image search returns different results as &#945; increases from 0 (image only) to 1 (text only). For example, when we combine a word (horse) and an image (a stripped pattern) into a query, the search finds images similar to the zebra's skin pattern when &#945; is close to 0, or finds images of typical horses when &#945; is close to 1, but not necessarily a zebra for either case until when &#945; is somewhere close to 0.5 (Fig. <ref type="figure">7</ref>, left). This observation is generalizable to other examples. See similarly graded changes in (Fig. <ref type="figure">7</ref>, right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Discussion</head><p>In summary, we apply visual grounding to not only words but also relations between words through cross-modal contrastive learning. The results suggest that grounding language learning in vision renders semantic representations more interpretable by human intuition. The grounded semantic space has its principal dimension encode the concrete-to-abstract variation consistent with human ratings and neurobiological knowledge. The grounded semantic representations are better clustered by finer categories and capable of compositional reasoning (e.g., zebra = striped horse). In addition, our work also shows compelling evidence that both text and image-informed semantics are represented in a common, continuous, and grounded semantic space. Although this notion has been hypothesized in neuroscience and linguistics, it has been rarely implemented and demonstrated with computational models. Uniquely, we demonstrate that a continuously varying combination of a text and an image into a multimodal query can be used to search images, showing results that make intuitive sense.</p><p>Several limitations of our work are noteworthy. The datasets used to train our model are orders of magnitude smaller than those used in recent studies <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>. Scaling up the model training with increasingly larger datasets is expected to greatly improve the model's performance for cross-modal tasks, while generally preserving the interpretability of the grounded language representations as described herein. Some of our experiments and results are preliminary and primarily for illustrative purposes and await more comprehensive and quantitative evaluation in future studies, especially with more downstream vision-language tasks. Whereas our evaluation focuses on the language model, grounding language to vision may also have refined the visual stream, awaiting further evaluation against visual tasks, as demonstrated in <ref type="bibr">[13,</ref><ref type="bibr">14]</ref>.</p><p>The visually grounded language model may be usable as a computational model for studying the grounded cognition -a theory in cognitive science <ref type="bibr">[15,</ref><ref type="bibr">16]</ref>. Ungrounded linguistic models are explanatory about semantic processing in the brain's language network <ref type="bibr">[66]</ref><ref type="bibr">[67]</ref><ref type="bibr">[68]</ref><ref type="bibr">[69]</ref><ref type="bibr">[70]</ref>. Combining the grounded language model with human behavioral and neural data may elucidate how the language network interacts with distributed sensory and motor areas for semantic processing <ref type="bibr">[71]</ref>.</p><p>It is natural to extend this study by incorporating other sensory input <ref type="bibr">[72,</ref><ref type="bibr">73]</ref> and further ground language learning in action <ref type="bibr">[74]</ref><ref type="bibr">[75]</ref><ref type="bibr">[76]</ref><ref type="bibr">[77]</ref> and emotion <ref type="bibr">[78]</ref>. This study also leaves an open question as to whether grounding should occur at an early or late stage of natural language processing, which awaits further exploration and evaluation. For comprehensive modeling of language grounding, it is desirable to expose an agent to a naturalistic and multi-sensory environment and to engage interactive actions to allow the agent to learn knowledge in the physical world, like how humans learn language.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>35th Conference on Neural Information Processing Systems (NeurIPS 2021).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>bert-base-uncased: https://huggingface.co/transformers/pretrained_models.html</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p><ref type="bibr">41</ref>, 600 images from the validation dataset of Open Images Dataset V 6.</p></note>
		</body>
		</text>
</TEI>
