<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Tag-grounded Visual Instruction Tuning with Retrieval Augmentation</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</publisher>
				<date>11/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10615776</idno>
					<idno type="doi">10.18653/v1/2024.emnlp-main.120</idno>
					
					<author>Daiqing Qi</author><author>Handong Zhao</author><author>Zijun Wei</author><author>Sheng Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[GPT4:This photo features four mangosteens on a wooden table. Three of them are uncut, showing their round, dark purple exterior. The fourth mangosteen is cut open, revealing its white, fleshy interior, which is divided into segments. The table surface has a natural wood grain pattern and adds a rustic touch to the image.
LLaVA:The image features a close-up of three small, round, and purple fruits, possibly pomegranates, sitting on a wooden surface. The fruits are placed in a row, with one on the left, one in the middle, and one on the right. In addition to the fruits, there is a knife located on the right side of the image, likely used for cutting the fruits. The scene appears to be a still life composition, showcasing the vibrant color and texture of the fruits.
ShareGPT4V:The image captures a close-up view of four lychee fruits, their vibrant colors standing out against the wooden surface they rest on. The fruits are arranged in a square formation, with one at each corner. The top left and bottom right corners each hold a whole lychee, their green tops contrasting with the deep purple of the fruit. The top right corner features a lychee that has been cut open, revealing the white interior that is characteristic of this tropical fruit. The background, though blurred, hints at the presence of green leaves, adding a touch of nature to the scene.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ours:</head><p>The image features a close-up of several mangosteen fruits. There are four mangosteens in the scene, with one of them cut open, revealing the white flesh inside. The other three mangosteens are still intact, showcasing their unique texture and appearance. The fruits are placed on a wooden surface, which adds a natural and rustic touch to the scene. The close-up view highlights the vibrant colors and intricate details of the mangosteen fruits. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MM-Vet</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Abstract</head><p>Despite recent advances in general visual instruction-following ability of Multimodal Large Language Models (MLLMs), when diving into low-level details, they still struggle with critical problems when required to provide a precise and detailed response to a visual instruction: (1) failure to identify novel objects or entities, (2) mention of non-existent objects and (3) neglect of object's attributed details.</p><p>Intuitive solutions include improving the size and quality of data or using larger foundation models. They show effectiveness in mitigating these issues, but at an expensive cost of collecting a vast amount of new data and introducing a significantly larger model. Standing in the intersection of them, we examine the three object-oriented problems from the perspective of the image-to-text mapping process by the multimodal connector. In this paper, we first identify the limitations of multimodal connectors stemming from insufficient training data. Driven by it, we propose to enhance the mapping with retrieval-augmented tag tokens, which contain rich object-aware information such as object names and attributes. With our Tag-grounded visual instruction tUNing with retrieval Augmentation, TUNA outperforms baselines that share same language model and training data on 12 benchmarks. Furthermore, we show the zero-shot capability of TUNA when provided with specific datastores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Multimodal Large Language Models (MLLM) have witnessed remarkable progress recently <ref type="bibr">(Chen et al., 2023c;</ref><ref type="bibr">Liu et al., 2023a</ref><ref type="bibr">Liu et al., , 2024;;</ref><ref type="bibr">Bai et al., 2023;</ref><ref type="bibr">Chen et al., 2023a;</ref><ref type="bibr">Dai et al., 2023;</ref><ref type="bibr">Ye et al., 2023;</ref><ref type="bibr">Zhu et al., 2023a;</ref><ref type="bibr">Zhang et al., 2023)</ref>, exhibiting superior ability in following vision-andlanguage instructions. Despite their effectiveness in providing general responses, their performance often degrade when required to give a detailed and accurate answer to the question associated with an image with novel objects, named entities or complex scenes with rich and subtle details. Specifically, they frequently encounter challenges (Fig. <ref type="figure">1</ref>) in: 1. identifying novel objects and named entities, 2. preventing the generation of objects that do not align with the target images, and 3. delivering a comprehensive description that covers the details of the target images. We uncover the</p><p>LLM Text Embedding Space CLIP Embedding Space Span of LLaVA Data-1.2M Span of Datastore-15M Span of LLM Space Connector Mapping Retrieval Mapping Samples Retrieved Samples</p><p>Figure <ref type="figure">2</ref>: Top: the process of translating image embeddings to text embeddings (LLaVA <ref type="bibr">(Liu et al., 2024)</ref>). Bottom: Image classification accuracy of CLIP <ref type="bibr">(Radford et al., 2021)</ref> and MLLMs built on it. some of the potential causes of above challenges starting from the commonly adopted two-branch structure and the two-stage training paradigm of MLLMs: the first-stage pre-training and secondstage supervised fine-tuning (SFT). Most existing MLLMs such as LLaVA <ref type="bibr">(Liu et al., 2024)</ref> comprise two modules: (1) a vision branch consisting of a vision encoder and a multimodal connector, and (2) a Large Language Model (LLM). In the pre-training stage with large-scale image-text pairs, the multimodal connector often learns to translate the outputs of the vision encoder to text embeddings, followed by the SFT stage which enhances the multi-modal instruction-following capabilities with instruction-format data.</p><p>Despite the promising zero-shot capability of the vision encoder, such as CLIP <ref type="bibr">(Radford et al., 2021)</ref>, which is pre-trained with over 400M imagetext pairs, its generalizability is bottlenecked by the learnt mapping of the multimodal connector when integrated into the MLLM framework. E.g., in the case of LLaVA <ref type="bibr">(Liu et al., 2024)</ref>, the two-stage training data is significantly smaller compared to the pre-training data of its vision encoder CLIP (1.2M vs. 400M), as a result, the connector often fails to effectively map the out-of-distribution (OOD) images to the corresponding LLM text embeddings. Therefore, LLM fails to successfully identify image contents. MLLMs' degradation on image classification performance <ref type="bibr">(Zhai et al., 2023)</ref> is a simple illustration. In Fig. <ref type="figure">2</ref> (Bottom), an obvious classification performance gap between MLLMs and their frozen vision encoder (CLIP) is observed. The absence of similar classification objects in LLaVA's training data could be a critical factor, which makes it particular hard for the multimodal connector to translate OOD CLIP embeddings of test images to LLM text embeddings.</p><p>One intuitive solution is to enrich the training datasets with more image-text pairs, however, as high-quality instruction-format data is particularly critical for visual instruction tuning <ref type="bibr">(Chen et al., 2023c)</ref>, it is very expensive to build high-quality training data with hundreds of millions of imagetext pairs of varying quality. Furthermore, the training could also become exceedingly burdensome.</p><p>Instead of directly improving the connector mapping with heavy training, could we build another lightweight new mapping as a complementary that effectively attends to objects, especially OOD ones? Motivated by retrieval augmented generation (RAG) <ref type="bibr">(Ramos et al., 2023b,a;</ref><ref type="bibr">Yang et al., 2023;</ref><ref type="bibr">Hu et al., 2023;</ref><ref type="bibr">Lin et al., 2024;</ref><ref type="bibr">Li et al., 2023c;</ref><ref type="bibr">Yasunaga et al., 2022)</ref>, we propose a retrieval mapping. As shown in Fig. <ref type="figure">2</ref> (Top), while the connector fails to correctly map the sample out of LLaVA training data span to its corresponding text embedding in LLM embedding space (i.e., the blue triangle sample is incorrectly mapped to the yellow square sample) , we introduce a largescale external datastore with a better coverage of novel objects, named entities, and attributes, for the retrieval of useful knowledge towards the input image. In this way, a new retrieval mapping could be built from the input image to corresponding LLM text embeddings (green dashed line in Fig. <ref type="figure">2</ref>).</p><p>While most existing works retrieve relevant captions as extra knowledge, it may not apply here because all three challenges mentioned above are oriented with object, where cleaner object-aware knowledge is urgent, instead of noisy captions. Therefore, we want to retrieve tags of the images that are similar to the input image as extra knowledge, where we can further enrich each tag representation with image region feature and adaptive weights to fulfill the potential of useful tags. To this end, we introduce a Tag-grounded visual instruction tUNing with retrieval Augmentation, termed TUNA, that performs a knowledge-aware and taggrounded generation. With grounded tags, TUNA is effective in identifying novel objects, named entities, and generate tag-oriented response which pays more attention to image details.</p><p>We summarize our contributions as follows: (i) We identify potential factors hindering MLLMs and first propose a tag-grounded visual instruction tuning with retrieval-augmentation (TUNA) with enhanced knowledge on novel objects, more atten-tion to details, and less mention of non-existent objects. (ii) To fulfill the potential of tags, We carefully designed the image-aware tag encoder, which produces tag embeddings enhanced by image features with an adaptive weight. (iii) We evaluate TUNA on extensive benchmarks along with a series of qualitative results, and show its zero-shot capability when provided with specific datastores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Works</head><p>Multimodal Large Language Models. MLLMs evolve rapidly nowadays. With LLMs, while existing works <ref type="bibr">(Li et al., 2022</ref><ref type="bibr">(Li et al., , 2023d</ref>) enable basic visual tasks like visual question answering, more recent works <ref type="bibr">(Chen et al., 2023a;</ref><ref type="bibr">Liu et al., 2024)</ref> shows proficiency in image-text dialogues through alignment and fine-tuning. Subsequent research <ref type="bibr">(Bai et al., 2023;</ref><ref type="bibr">Chen et al., 2023b;</ref><ref type="bibr">Dai et al., 2023;</ref><ref type="bibr">Li et al., 2023a;</ref><ref type="bibr">Peng et al., 2023;</ref><ref type="bibr">Ye et al., 2023;</ref><ref type="bibr">You et al., 2023)</ref> enhances LLMs by emphasizing data quality and diversity. With grounding data, a branch of works <ref type="bibr">(Ye et al., 2023;</ref><ref type="bibr">You et al., 2023;</ref><ref type="bibr">Chen et al., 2023b;</ref><ref type="bibr">Peng et al., 2023)</ref> improves LLMs' grounding capability. Despite their evolution, as they share a similar multimodal connector module that performs image-totext translation, a lingering fundamental problem persists: Out-of-distribution (OOD) images, such as novel objects, named entities, new scenes, etc., cannot be translated to text embeddings effectively, leading to misaligned answers, missing details or mention of non-existent objects from LLM.</p><p>Retrieval-Augmented Multimodal Learning. Retrieval-augmented language generation (RAG) consists of conditioning generation on additional information that is retrieved (e.g., with clustering <ref type="bibr">(Zhao et al., 2017)</ref>) from an external datastore. Recently, A branch of works <ref type="bibr">(Ramos et al., 2023b,a;</ref><ref type="bibr">Yang et al., 2023;</ref><ref type="bibr">Hu et al., 2023;</ref><ref type="bibr">Lin et al., 2024;</ref><ref type="bibr">Li et al., 2023c)</ref> integrate it into image captioning, where relevant captions are retrieved to guide the captioning. Distinct from them, in visual instruction tuning, where detailed and dense responses based on the multimodal instructions are often required, cleaner object-level information, such as names and attributes of novel objects, named entities, is urgent. We provide a more detailed discussion in Appendix A. Multimodal Learning with Tags. Existing works <ref type="bibr">(Huang et al., 2023;</ref><ref type="bibr">Zhou et al., 2020;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Hu et al., 2021;</ref><ref type="bibr">Qi et al., 2024a;</ref><ref type="bibr">Huang et al., 2022)</ref> show the effectiveness of introducing object tags as anchor points to help the learning of semantic alignments between images and texts in the training data. In the context of Fig. <ref type="figure">2</ref>, they better align in-distribution data (yellow and purple samples) with tags. Our goal is distinctive from them in that, We do not aim to learn better representations of training data, instead, we want to (1) improve the tag-grounded generation capability of MLLMs and (2) acquire new knowledge with retrieved tags from external datastore. Besides, as they treat object tags as anchor points for feature learning, tags are commonly humanused ones <ref type="bibr">(Huang et al., 2023)</ref> as guidance. For instance, Tag-to-Text <ref type="bibr">(Huang et al., 2023)</ref> collects 3,429 well-used tags filtered by human annotation. While in our case, where the large coverage is the priority, less frequently used tags (e.g., named entities) are also desired, resulting in a total of 3M tags (details in Appendix A).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Tag-ground Visual Instruction Tuning</head><p>In this section, we first introduce how we extract tags from 15M captions from CC12M <ref type="bibr">(Changpinyo et al., 2021)</ref> and CC3M <ref type="bibr">(Sharma et al., 2018)</ref>. Then we present how we build and use the datastore, followed by the illustration of TUNA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Multimodal Retriever</head><p>From Captions to Tags. As introduced in Sec. 1, one of the fundamental challenges for MLLMs is to effectively translate image tokens to LLM text embeddings, especially for OOD images that contain novel objects. With better translation, LLMs would be less likely to confuse with them, which could improve the identification of objects. Thus in addition to the mapping learnt by the connector, we use a multimodal retriever to retrieve relevant information as an additional retrieval mapping (Fig. <ref type="figure">2</ref>) to enhance the translation process. Therefore, the quality of the retrieval mapping is critical. As a result, object-oriented tags as retrieved information would be very helpful. Additionally, with taggrounded generation, retrieved tags also serve as groundings or hints, which could prompt the LLM to generate tag-aware contents if the tag is relevant to the input image, which would also be helpful in alleviating missing objects or visual details.</p><p>Towards this end, we use CLIP image embeddings from image-text paired datasets as keys and corresponding tags as values. However, existing Table 1: Extracted tags from CC3M and CC12M</p><p>large-scale image-text datasets such as Conceptual Captions <ref type="bibr">(Sharma et al., 2018;</ref><ref type="bibr">Changpinyo et al., 2021)</ref> only contain captions. To mine tags from texts, we parse each caption into a set of tags with a combination of FACTUAL scene graph parser <ref type="bibr">(Li et al., 2023f)</ref> and Name Entity Recognition (NER) with spaCy, yielding 3M tags extracted from 15M captions in CC3M <ref type="bibr">(Sharma et al., 2018)</ref> and CC12M <ref type="bibr">(Changpinyo et al., 2021)</ref>. We show several examples in Fig. <ref type="figure">3</ref>. Details of the mining process are available in Appendix B. We also provide a statistics of the obtained tags in Tab 1. Datastore and Cross-Modal Retrieval. With processed image-tags pairs, our datastore is indexed by FAISS library <ref type="bibr">(Johnson et al., 2019)</ref> with image CLIP embeddings as keys and associated tags as values. Given a query image, a k-nearest neighbor retrieval with cosine similarity of embeddings between it and datastore images is performed. The tags of top-k retrieved images are input to TUNA as additional knowledge. In experiments, we use k=5.</p><p>We consider CC12M <ref type="bibr">(Changpinyo et al., 2021)</ref>, CC3M <ref type="bibr">(Sharma et al., 2018)</ref> and COCO <ref type="bibr">(Lin et al., 2014)</ref> training set as our datastore, resulting in 15M image-text pairs. In experiments, we use a whole combination, as well as parts of them, as our datastore to study how different datastores affect results.</p><p>For Fashion QA, we use a combination of fashion data as our retrieval datastore.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">TUNA</head><p>Architecture. The framework of TUNA is illustrated in Fig. <ref type="figure">4</ref>. Given a language instruction X q , and an input image X v , a set of images with associated tags are retrieved from the datastore. Assume there are M tags in total, they are mixed together and denoted as {X i t } M i=1 . For image, a frozen pre-trained CLIP vision encoder ViT-L/14 is employed to extract the visual feature Z v = g(X v ) &#8712; R [H&#215;W ]&#215;D , followed by a MLP multimodal connector h(&#8226;) that translates the CLIP vision feature to text embeddings:</p><p>Similar to LLaVA <ref type="bibr">(Liu et al., 2024)</ref>, the grid visual features before the last Transformer layer are considered in our experiments. The language instruction X q is tokenized and projected to text embeddings H q by the pre-trained LLM's tokenizer and embedding layer. Specifically, tags</p><p>are encoded by our image-aware tag encoder.</p><p>Image-Aware Tag Encoder. Given a tag X i t , its tag representation H i , which is encoded by our image-aware tag encoder, is a tuple of its text embedding H i t and the its tag-aware image token (embedding) H i vt , which contains visual features of the input query image related to this tag. With this image token, LLM could better attend to details of the tag-related object in the input image. Same with X q , the tag X i t is tokenized and projected to H i t with the LLM's tokenizer and embedding layer. To obtain the tag-aware image token, the tag-aware image feature Z i vt &#8712; R 1&#215;D is first extracted from the grid visual features of the input image via the crossattention module:</p><p>extracted by the frozen CLIP text encoder. Then we obtain the tagaware image token H i vt = h(Z i vt ). Finally, the tag representation H i consists of the tuple (H i vt , H i t ). Iterating over all tags, we have {H i } M i=1 .</p><p>Adaptive Weight Tuner. As retrieved images may contain less relevant or irrelevant tags, e.g., the tag durian in Fig 4, we apply an adaptive weight tuner over them to give more attention to highly relevant tags while ignoring less related ones. Specifically, the score of H i is the cosine similarity between Q i t and the global CLIP visual feature (i.e., the &lt;CLS&gt; token) of the input image. The scores are normalized to [0,1] as the final weights, which are applied to H i vt and H i t before input to the LLM.</p><p>Supervised Fine-Tuning. We consider Vicuna-7B <ref type="bibr">(Chiang et al., 2023)</ref>, a decoder-only LLM instruction-tuned on top of LLaMA <ref type="bibr">(Touvron et al., 2023)</ref>, as our language model. We use both image and text encoders from CLIP-ViT-L/14@336p.</p><p>We initialize the pre-trained multimodal connector from LLaVA-1.5 <ref type="bibr">(Liu et al., 2023a)</ref>. During the instruction tuning, we always keep the weights of the vision encoder frozen, and update both the pre-trained weights of the connector and the LLM.</p><p>Language Response Language Model (Vicuna-v1.5-7B) Language Instruction Vision Encoder Input Image mangosteen fruit durian</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Retrieved Images and Tags</head><p>Please describe this photo in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Image-Aware Tag Encoder</head><p>The image features a close-up of several mangosteen fruits. There are four mangosteens in the scene, with one of them cut open, revealing the white fles ...</p><p>... ... ... mangosteen tropical ripe mangosteen fruit cut ... mangosteen durian queen Datastore Vision Encoder Connector Text Encoder Image-Aware Tag Encoder LLM Tokenizer ... Tag-Aware Image Features Connector LLM Embedding Layer Tag Text Tokens S00 S10 S20 Tag-Aware Image Tokens CLS P1 P2 Mangosteen fruits durian P3 P4 P5 P6 P7 P8 ... PN Cross-Attention S00 S10 S20 S01 S02 ... S03 Adaptive Weight Tuner Attention Scores S00 S10 S20 ... ... mangosteen fruit durian mangosteen fruit durian ... </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiment</head><p>In this section, we first present the training details of TUNA and benchmarks. Then we introduce quantitative and qualitative comparison with popular open-source models, followed by detailed analysis experiments and ablation studies.</p><p>Training Details. TUNA is finetuned on instruction data for one epoch, following existing works <ref type="bibr">(Liu et al., 2023a;</ref><ref type="bibr">Chen et al., 2023c)</ref>.</p><p>We consider two different instruction-following datasets in our experiments: LLaVA-665K <ref type="bibr">(Liu et al., 2023a)</ref> and ShareGPT4V-665K (Chen et al., 2023c) as our instruction-following data during fine-tuning separately, resulting in two versions of our model, TUNA and TUNA + . ShareGPT4V-665K contains instruction-following data with higher quality. Details on datasets are available in Appendix C. We apply a learning rate of 2e-5 and a batch size of 128. The training takes 12&#8764;14 hours with 8 A100 GPUs with ZeRO3. Details are available in Appendix C. Benchmarks. We compare TUNA with baselines on 12 benchmarks, including VQA benchmarks and multimodal benchmarks designed for LLMs. Details are available in Appendix G. 4.1 Comparison with Baselines Main Results. In Tab. 2, we provide a quantitative comparison of TUNA with popular open-source MLLMs. On 12 benchmarks, TUNA consistently outperforms previous LLMs that are finetuned from the same instruction-tuning datasets as ours with the same configuration on the vision encoder and language model (Vicuna-7B), especially on recent multimodal benchmarks with more notable im-provements.</p><p>As the size of LLM and different choices of instruction-following data can significantly improve the model performance, we mark the models gray that are equipped with a larger 13B language model or finetuned from currently unavailable datasets of higher quality and quantity. Specifically, LLaVA-1.6 (or LLaVA-NeXT)<ref type="foot">foot_0</ref> is finetuned from larger instruction-following data of higher quality, with additional user instruct data. Besides, it equips the better vision encoder with dynamic high resolution, known as AnyRes (AR).</p><p>Although it is not a fair comparison, we still outperform LLaVA-1.6 in MMB CN , MMB and POPE, and the corresponding 13B models in MMB CN , MMB, POPE and LLaVA-W.</p><p>How Can TUNA Improve the Recognition of Novel Objects and Entities? As visualized in Fig. <ref type="figure">2</ref> (Top), with our 15M large-scale datastore, the new retrieval mapping could greatly compensate for the original LLaVA multimodal connector that learns from around 1M data. With the additional mappings from retrieval data, TUNA is expected to show particularly improvements over questions towards novel objects or entities in the given input image. We show sub-tasks from MME <ref type="bibr">(Fu et al., 2023)</ref> and MMB <ref type="bibr">(Liu et al., 2023b)</ref> that consists of such questions in Tab. Method LLM V-Enc. IT VQA v2 GQA VizWiz SQA I VQA T POPE MME MMB MMB CN SEED LLaVA W MM-Vet BLIP-2 Vicuna-13B --41.0 41.0 19.6 61.0 42.5 85.3 1293.8 --46.4 38.1 22.4 InstructBLIP Vicuna-7B -1.2M -49.2 34.5 60.5 50.1 --36 23.7 53.4 60.9 26.2 InstructBLIP Vicuna-13B -1.2M -49.5 33.4 63.1 50.7 78.9 1212.8 ---58.2 25.6 Shikra Vicuna-13B -5.5M 77.4 ------58.8 ----IDEFICS-9B LLaMA-7B -1M 50.9 38.4 35.5 -25.9 --48.2 25.2 ---IDEFICS-80B LLaMA-65B -1M 60.0 45.2 36.0 -30.9 --54.5 38.1 ---Qwen-VL Qwen-7B -50M 78.8 59.3 35.2 67.1 63.8 --38.2 7.4 56.3 --Qwen-VL-Chat Qwen-7B -50M 78.2 57.5 38.9 68.2 61.5 -1487.5 60.6 56.7 58.2 --ShareGPT4V Vicuna-13B CLIP V-L 336 665K(S) 81.0 63.4 55.6 71.2 62.2 85.9 1618.7 68.5 63.7 70.8 79.9 43.1 LLaVA-1.5 Vicuna-13B CLIP V-L 336 665K(L) 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 63.6 61.6 70.7 35.4 LLaVA-1.6/NeXT Vicuna-7B CLIP V-L AR 760K(N) 81.8 64.2 57.6 70.1 64.9 86.5 1519.0 67.4 60.6 70.2 81.6 43.9 ShareGPT4V Vicuna-7B CLIP V-L 336 665K(S) 80.6 63.3 57.2 68.4 60.4 85.3 1567.4 68.8 62.2 69.7 72.6 37.6 Ours + Vicuna-7B CLIP V-L 336 665K(S) 81.1 63.4 57.4 70.8 60.4 89.6 1583.8 70.8 65.0 70.6 80.1 40.1 LLaVA-1.5 Vicuna-7B CLIP V-L 336 665K(L) 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 64.3 58.3 58.6 63.4 30.5 Ours Vicuna-7B CLIP V-L 336 665K(L) 79.7 62.6 50.0 68.3 58.4 89.5 1540.0 68.5 64.0 59.6 75.4 33.2</p><p>Table <ref type="table">2</ref>: Comparison with SoTA methods on 12 benchmarks. Our model achieves the best performance on 12 benchmarks compared with LLMs that are finetuned from the same instruction tuning (IT) datasets with the same configuration on the vision encoder (V-Enc.) and language model (Vicuna-7B). Best results are in bold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input Image</head><p>Is there a snowboard in the image? Answer the question using a single word or phrase. for the connector to map it to somewhere close to text embeddings of "mangosteen" in the LLM embedding space, as illustrated in Fig 2 . When the question about the given image is a little tricky, e.g., in Fig 5 (a), the MLLM is asked if a painting of a building exists in the form of architecture, LLaVA-1.5 is confused on whether it is a real architecture or a painting. However, TUNA easily distinguished it from real architectures with additional knowledge from retrieved tags of similar images in datastore.</p><p>How Can TUNA Help to Identify the Existence of Objects? With an input image, the retrieved images are often similar to it or in the similar context.</p><p>Model Posters Celebrity Artwork landmark Image Style Celeb LLaVA-1.5 146.6 137.1 119.5 163.8 69.1 83.8 Ours 155.9 154.7 128.7 166.3 81.1 85.8</p><p>Table 3: Results on sub-tasks of MME (Fu et al., 2023) and MMB (Liu et al., 2023b), where questions are towards novel objects, entities or scenes in the image. Otherwise mentioned, backbone LLM is Vicuna-7B. Datasets Metrics Ours Ferret InstructBLIP LLaVA mPLUG-Owl Random Accuracy (&#8593;) 91.00 90.24 88.57 88.00 53.97 Precision (&#8593;) 98.05 97.72 84.09 97.44 52.07 Recall (&#8593;) 84.10 83.00 95.13 78.80 99.60 F1 Score (&#8593;) 90.93 89.76 89.27 87.13 68.39 Popular Accuracy (&#8593;) 90.16 84.90 82.77 87.43 50.90 Precision (&#8593;) 95.46 88.24 76.27 95.24 50.46 Recall (&#8593;) 84.20 80.53 95.13 78.80 99.40 F1 Score (&#8593;) 90.56 84.21 84.66 86.24 66.94 Adversarial Accuracy (&#8593;) 88.43 82.36 72.10 85.50 50.67 Precision (&#8593;) 91.99 83.60 65.13 90.99 50.34 Recall (&#8593;) 84.20 80.53 95.13 78.80 99.33 F1 Score (&#8593;) 87.63 82.00 77.32 84.45 66.82 Average F1 89.50 85.32 83.75 85.94 67.38</p><p>Table 4: Results on POPE. We show most competing baselines. Full table is available in Appendix F. TUNA outperform Ferret (You et al., 2023), which is finetuned on grounding and referring data.</p><p>Intuitively, the retrieved images are very likely to contain similar elements or objects to the input image. Therefore, the tags could be helpful to provide additional hints to the LLM to pay special attention to them about their existence. We evaluate our model on POPE <ref type="bibr">(Li et al., 2023e)</ref>, a benchmark designed towards the existence of objects. Results are available in Tab. 4, we outperform competing baselines including referring and grounding MLLMs such as Ferret <ref type="bibr">(You et al., 2023)</ref> and Shikra <ref type="bibr">(Chen et al., 2023b)</ref>.  are available in Tab. 5. TUNA consistently outperforms baselines. We also provide one example in Fig. <ref type="figure">6</ref>. While LLaVA mentions non-existent boats, people, TUNA accurately describes the water body, the existence of green vegetation, and interestingly, the presence of houses and buildings behind the mountain (zoom in for better view). More interestingly, there are no retrieved noun tags directly related "houses" or "buildings". By removing tags one by one, we finally identify that the tag "accessible" contributes to the the description of houses and buildings. It is an interesting phenomenon that somehow tells us that not only nouns can remind the LLM the existence of objects, relevant adjectives can also teach the LLM to pay attention to visual details. In this case, "accessible" means "human can access to this place", which might remind the LLM the existence of houses and buildings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Ablation Study</head><p>Ablation of Adaptive Weight Tuner. Grounded on tags, intuitively, the quality of tags is critical to TUNA. However, retrieved tags could be noisy. E.g., the tag durian in Fig 4 . To this end, we apply an adaptive weight tuner in our image-aware tag encoder to allocate more weight to more relevant tags and less weight to less relevant ones. We first ablate the tuner module to show its effectiveness of this simple but critical component in alleviating the noises of tags. Without the adaptive weight tuner, all retrieved tags would be equal important and their weights are set to the maximum value. The result is shown in Tab. 6 (w/o tuner).</p><p>A clear performance drop is observed compared to the full method. It is reasonable because while related tags can provide useful information to the LLM, the irrelevant tags are misleading. Although it underperforms the full method, without the tuner, our model is still comparable or slightly better than LLaVA-1.5. This is favourable because it manifests that our model itself is somehow robust against less relevant tags without the tuner.</p><p>Effectiveness of Instruction Tuning. Since MLLMs are naturally in-context learners, we are interested in the effectiveness of our tag-grounded finetuning compared to the vanilla LLaVA-1.5, where tags are provided as in-context knowledge.</p><p>For fair comparison, we apply the weight tuner to both models. Let's refer this model as TUNA -. Results in Fig. <ref type="figure">6</ref> (w/o FT) indicates that, the LLM without tag-grounded instruction tuning cannot make effective use of informative tags.</p><p>Are Tags more Effective than Sentences? We compare TUNA with sentence-level retrieval in Tab. 6 (w/ captions). Instead of tags, we finetune TUNA with captions of retrieved images as additional knowledge. The image-aware tag encoder is also used, but the input tags are replaced by captions. Results show that sentence-level retrieval is not helpful. It is reasonable because tags provide cleaner and more object-related knowledge such as names, attributes, while captions are noisy.</p><p>Would Irrelevant Tags Hurt the Backbone during Inference? It is intuitive that a large-scale datastore often covers useful knowledge to the input image and question. Therefore, useful tags could be retrieved. However, there might be corner cases when retrieved tags are all irrelevant. Table 7: Ablations on the choice of datastores.</p><p>LLaVA-1.5. It manifests that, our method notably improves the backbone performance with useful tags and will not hurt the backbone performance when only irreverent tags are available.</p><p>Different Choices of Datastore. We also study how different choices of datastores can affect the model performance. In the default setting, we use a combination of CC12M, CC3M and COCO training set. In addition, we perform the tag-grounded instruction tuning with different datastores, and use them for retrieval during inference, respectively. Results are available in Tab. 7. Default setting with largest datastore size outperforms other baselines.</p><p>We have detailed analysis in Appendix H.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Zero-shot Inference on Fashion Domain</head><p>Fashion-Bench. To study TUNA on OOD data from another specific domain, we further collect data from FashionGen <ref type="bibr">(Rostamzadeh et al., 2018)</ref> validation set and create a benchmark to measure the model's instruction-following capability in fashion domain, similar to LLaVA-Bench. Following LLaVA, we also collect a set of 24 images from FashionGen, with one question associated with each image. The questions are from one of the three types: (1) Conversation. We design a conversation between the assistant and a person asking questions about the product, including the product brands, categories, materials, etc. Only questions that have definite answers are considered. E.g., What is the brand of this product?</p><p>(2) Detailed Description. We ask the assistant to give a comprehensive and detailed desperation of the given product. E.g., Please describe the product in this image in detail.</p><p>(3) Complex Reasoning. The above two types focus on the</p><p>t is likely a nique design of the is part of the Neil hite background, ttings, making it a nding in a white , making them an art of the Calvin f cotton and have a standing in front of a</p><p>Question: Please describe this product in detail, including its brand, category, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LLaVA-1.5:</head><p>The brand is called "H&amp;M. " The product is a pair of black and white pants with a logo on the side. The design is casual and comfortable, making them suitable for everyday wear. The pants are made of a blend of materials, providing both style and functionality. The logo on the side adds a touch of personality and brand recognition to the pants.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ours:</head><p>The product is a pair of black and white pants with a rubber and terry cloth applique.</p><p>The pants are designed with a three-pocket styling and an elasticized waistband with a drawstring closure at the interior. The brand is KTZ, and the pants are made of 100% cotton.</p><p>The design and composition of the pants suggest a casual and comfortable style, making them suitable for various occasions. visual content itself, based on which we further create reasoning questions. E.g., What occasions is this clothing suitable for? Evaluation. We follow LLaVA to perform GPTassisted evaluation. After obtaining the responses from models, we feed the question, ground truth text information, and the generated responses, to a judge (GPT-4). The judge evaluates the quality of generated responses from models, and gives an overall score on a scale of 1 to 9. We report the relative scores w.r.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Input Image Retrieved Images</head><p>t. the GPT-4 model that has the textural ground truth description as input. Details are available in Appendix D. Model Average Reasoning Conversation Detail LLaVA-v1.5-7B 57.9 73.2 62.8 55.4 LLaVA + sentence-level RAG 59.6 74.4 64.1 57.8 Ours 68.0 78.9 74.4 65.9 Results. We use a combination of fashion data as our retrieval datastore, including: Fashion-Gen <ref type="bibr">(Rostamzadeh et al., 2018)</ref> training set, Fash-ion200k <ref type="bibr">(Han et al., 2017)</ref> and PolyvoreOutfits <ref type="bibr">(Vasileva et al., 2018)</ref>, resulting in a total of 546.5K image-text pairs. We extract tags of a product from captions. Results in Tab. 8 demonstrates the effectiveness of TUNA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this paper, we discussed three challenges for MLLMs: (1) mention of non-existent objects, (2) neglect of visual details and (3) failure to identify novel objects and entities, and one of the potential causes: the bottleneck from the image-totext translation. To alleviate these problems, we introduced TUNA, a tag-grounded visual instruction tuning framework with retrieval-augmentation, which achieves competing performance over 12 VQA and multimodal benchmarks, compared to baselines with the same LLM and finetuning data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations</head><p>Being lightweight and effective, our model could be easily further improved with simple modifications to overcome existing limitations. Our model is bottlenecked by the capability of CLIP <ref type="bibr">(Radford et al., 2021)</ref>, which can affect our model performance in two ways. First, the quality of retrieved images are highly related to it. As we use tags associated to the retrieved images as additional information, more relevant images we have, more relevant tags we obtain. Second, our adaptive weight tuner also relies on the knowledge of CLIP. For instance, even if we obtain a highly relevant tag, e.g., "Diamond Head" from the retrieved similar images, if image-text pairs containing "Diamond Head" do not exist in the 400M pre-training data of CLIP, CLIP cannot effectively align the text embeddings of "Diamond Head" to a photo of diamond head, subsequently, low weights would be assigned to the tag "Diamond Head" in our weight tuner, even though it is the ground truth. Fortunately in most cases, CLIP is capable of handling it. If not, we can easily replace CLIP with a more powerful vision-language models. Our current design of the retriever is also simple, where we retrieve images regardless of the language instruction. A solution could be using Qformer <ref type="bibr">(Li et al., 2023d)</ref>, where instruction-aware visual features could be used for retrieval. We leave them for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Extended Related Works</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Retrieval-Augmented Multimodal Learning</head><p>We are distinct from existing works on retrievalaugmented multimodal learning <ref type="bibr">(Ramos et al., 2023b,a;</ref><ref type="bibr">Yang et al., 2023;</ref><ref type="bibr">Hu et al., 2023;</ref><ref type="bibr">Lin et al., 2024;</ref><ref type="bibr">Li et al., 2023c)</ref> in that we are motivated from the object-oriented challenges in visual instruction tuning, which leads to notable differences in (1) target task, (2) motivations, (3) retrieved knowledge and (4) usage of additional information.</p><p>Most existing works above focus on image captioning, where short captions (usually one or two sentences) are generated given an input image. While in our case, our model is asked to follow the given instruction, infer from the given image, and often provide a long and detailed response. The difference of tasks therefore lead to different challenges, thus the motivation of using retrievalaugmentation is also distinct. While existing models exploit retrieved captions for general purposes of providing related contents to help the captioning of the current image (e.g., help to better organize the language, or provide additional knowledge on image content or context), in our scenario, the retrieved tags aim to provide rich object-aware information to enhance the attention to object details, and help with the object or entity identification. Moreover, the capability of performing taggrounded generation is enabled during our visual instruction tuning. In addition, we have meticulously crafted novel modules aimed at enriching the representation of retrieved tags and adaptively reallocating the attention to them based on their relevance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Multimodal Learning with Tags</head><p>We are distinct from existing works <ref type="bibr">(Huang et al., 2023;</ref><ref type="bibr">Zhou et al., 2020;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Hu et al., 2021;</ref><ref type="bibr">Huang et al., 2022)</ref> that introduce object tags as anchor points to help the learning of semantic alignments between images and texts in (1) substantially different objectives, (2) type of used tags and (3) the usage of them.</p><p>Existing works <ref type="bibr">(Huang et al., 2023;</ref><ref type="bibr">Zhou et al., 2020;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Hu et al., 2021;</ref><ref type="bibr">Huang et al., 2022)</ref> use tags for the representation learning of semantic alignments between images and texts. For instance, OSCAR <ref type="bibr">(Li et al., 2020)</ref> propose to use object tags to align the object-region features in the pre-trained linguistic semantic space. Wu et al. <ref type="bibr">(Wu et al., 2016)</ref> utilize solely the predicted object tags as input to an LSTM for image captioning, whereas You et al. <ref type="bibr">(You et al., 2016)</ref> incorporate both tags and region features. In contrast, Zhou et al. <ref type="bibr">(Zhou et al., 2020)</ref> augment region features with the object prediction probability vector, leveraging salient regions identified by object detectors, to enrich the visual input for pre-training. In our case, object-oriented tags are used as groundings to provide additional information on the given input image, therefore alleviating neglect of object details and failure to identify novel objects or entities. Besides, the capability of tag-grounded instructionfollowing in our model is also unique. The large and abundant annotation-free tags we have (around 3.2M) also makes our work distinctive from the above. As we want to inform our model of more relevant object-oriented knowledge like object names, object attributes while ignoring less relevant ones, we also design new modules towards this end.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Continual Learning of Multimodal Large Language Models</head><p>Continual Learning aims to continuously learn a model from new data in different manners, such as class-incremental <ref type="bibr">(Qi et al., 2023)</ref>, dataincremental <ref type="bibr">(Sheu et al., 2022;</ref><ref type="bibr">Hua et al., 2020)</ref> and domain-incremental <ref type="bibr">(Qi et al., 2024b;</ref><ref type="bibr">Zhu et al., 2023b</ref><ref type="bibr">Zhu et al., , 2024))</ref>. <ref type="bibr">Zhai et al. (Zhai et al., 2023)</ref> studies the continual learning of multimodal large language models in the context of object classification. They demonstrate that the finetuned popular open-source MLLMs, such as LLaVA <ref type="bibr">(Liu et al., 2024)</ref>, exhibited degraded performance compared to their pretrained frozen vision encoders, such as CLIP <ref type="bibr">(Radford et al., 2021</ref>). It is an example of the problem caused by the misalignment between the CLIP embeddings of the input image and the LLM text embeddings, as we illustrated in the Introduction Section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Tag Mining</head><p>To mine tags from texts, we parse each caption into a set of tags with a combination of FAC-TUAL scene graph parser <ref type="bibr">(Li et al., 2023f)</ref> and Named Entity Recognition (NER) with spaCy, yielding 3M tags extracted from 15M captions in CC3M <ref type="bibr">(Sharma et al., 2018)</ref> and CC12M <ref type="bibr">(Changpinyo et al., 2021)</ref>. We show several examples in    Given that the FACTUAL scene graph parser <ref type="bibr">(Li et al., 2023f)</ref> is built on a large language model, there is a slight probability that it may produce nonsensical lengthy sequences. We employ a filtering mechanism to exclude tags exceeding 30 characters in length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Training Details</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Datasets</head><p>LLaVA-665K <ref type="bibr">(Liu et al., 2023a)</ref> is collected and built with a variety of datasets, containing VQA, OCR, region-level VQA, visual conversation and language conversation data. In ShareGPT4V <ref type="bibr">(Chen et al., 2023c)</ref>, the supervised fine-tuning captions were collected from GPT4-Vision. Following Chen et al. <ref type="bibr">(Chen et al., 2023c)</ref>, a corresponding portion of detailed captions in the Supervised Fine-Tuning (SFT) datasets (i.e., LLaVA-665K) is replaced with a selection from the 100K GPT4-Vision-generated captions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Hyperparameter</head><p>We follow the hyperparameter setting in LLaVA-1.5 <ref type="bibr">(You et al., 2016)</ref>. Details are summerized in Tab. 9.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Zero-Shot Inference on Fashion Data D.1 Fashion-Bench</head><p>To explore the effectiveness of TUNA on OOD data from another specific domain, we further collect data from FashionGen <ref type="bibr">(Rostamzadeh et al., 2018)</ref> validation set and create a benchmark to measure the model's instruction-following capability in Fashion domain. Following LLaVA <ref type="bibr">(Liu et al., 2024)</ref>, we leverage GPT-4 to measure the quality of generated responses. Specifically, we create triplets consisting of image, ground-truth textual descriptions, and question. The candidate models (e.g., TUNA, LLaVA) predict the answers based on the question and the image. To provide an approximate upper bound, we build a reference prediction based on the question and the ground-truth textual descriptions, using the text-only GPT-4, following Liu et al. <ref type="bibr">(Liu et al., 2024)</ref> . After obtaining the responses from both models, we feed the question, visual information (in the format of textual descriptions), and the generated responses from both assistants, to the judge (i.e., text-only GPT-4). The text-only GPT-4 evaluates the helpfulness, relevance, accuracy, and level of detail of the responses from the assistants, and gives an overall score on a scale of 1 to 9, where a higher score indicates better overall performance. We report relative scores w.r.t. the text-only GPT-4 model that uses the textural ground truth description as visual input.</p><p>Similar to LLaVA-Bench (In-the-Wild) <ref type="bibr">(Liu et al., 2024)</ref>, we also collect a set of 24 images from FashionGen <ref type="bibr">(Rostamzadeh et al., 2018)</ref> validation set, with one question associated with each image. The questions are from one of the three types:</p><p>1. Conversation. We design a conversation between the assistant and a person asking questions about the product. A diverse set of questions are asked about the content of the image, including the product brands, categories, materials, etc. Only questions that have definite answers are considered. E.g., What is the brand of this product?</p><p>2. Detailed Description. We ask the assistant to give a comprehensive and detailed desperation of the given product. E.g., Please describe the product in this image in detail. We use a combination of fashion data as our retrieval datastore, including: Fashion-Gen <ref type="bibr">(Rostamzadeh et al., 2018)</ref> training set, Fash-ion200k <ref type="bibr">(Han et al., 2017)</ref> and PolyvoreOutfits <ref type="bibr">(Vasileva et al., 2018)</ref>, resulting in a total of 546.5K image-text pairs. To obtain the tags of a product, we extract them from the caption or associated product specifications (e.g., brand) of the product.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Complex</head><p>Results in Tab. 10 demonstrates the effectiveness of TUNA, especially on 'Conversation' and 'Detail', where retrieved tags on product specifications are helpful to identify the related details of the input product. Examples are available in Fig. <ref type="figure">9</ref> and Fig. <ref type="figure">10</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E More Examples</head><p>We present more examples with TUNA and LLaVA-1.5 in Fig. <ref type="figure">11</ref> and Fig. <ref type="figure">10</ref>. In Fig. <ref type="figure">11</ref>, we provide Out-of-Distribution (OOD) images of realworld products or television works, and ask TUNA and LLaVA-1.5 to provide answers to the question. In Fig. <ref type="figure">10</ref>, we provide Out-of-Distribution (OOD) images in fashion domain, and ask the models to provide answers to the question.</p><p>When provided with OOD images, where novel objects or entities often appear, LLaVA-1.5 fails to correctly or precisely identify them due to a limited number of training samples. Although the CLIP vision encoder, which is pre-trained with over 400M samples, can effectively extract their visual features, the multimodal connector cannot effectively map them to text embeddings input to the LLM. In contrast, TUNA is effective in identifying unseen objects or entities, as the input OOD image is directly mapped to a set of retrieved tags from a large-scale external datastore, which has a better coverage of OOD data.</p><p>In examples in Fig. <ref type="figure">10</ref>, where specific in-domain knowledge, i.e., fashion domain, is required for give a detailed and precise description of the given product, such as its brand, design, or composition (material), LLaVA fails to correctly identify them or response with detailed descriptions on them.</p><p>For instance, in the example in Fig <ref type="figure">9</ref>, the only useful information about the given product itself is "a black jacket with white polka dots", where LLaVA-1.5 fails to precisely describe it as a "blazer". Moreover, LLaVA-1.5 does not mention its design and brand even if we explicitly ask it the brand of this product. In contrast, TUNA precisely describes its design details, style and the brand, benefiting from the retrieved products which are similar to the input product in design, brand, category or style. TUNA could effectively refer to the retrieved tags and learn from the useful ones with our tag encoder.</p><p>Cases are similar in examples from Fig <ref type="figure">11</ref>, where TUNA correctly identifies the novel object in the input image with retrieved knowledge. Meanwhile, LLaVA-1.5 fails to identity the model of the Leica camera, Porsche car, and the name of the character and anime in the input images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Full Experiment Results</head><p>We show the full results on POPE in Tab. 11.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G Benchmarks</head><p>We compare TUNA with SoTA methods on 12 benchmarks, including five VQA benchmarks: VQA v2 <ref type="bibr">(Goyal et al., 2017)</ref>, GQA (Hudson and Manning, 2019), VizWiz <ref type="bibr">(Gurari et al., 2018)</ref>, ScienceQA-Image (SQA I ) <ref type="bibr">(Lu et al., 2022)</ref>, TextVQA (VQA T ) <ref type="bibr">(Singh et al., 2019)</ref>, and seven more recently multimodal benchmarks designed for LLMs: POPE <ref type="bibr">(Li et al., 2023e)</ref>, MME <ref type="bibr">(Fu et al., 2023)</ref>, MMBench (MMB) <ref type="bibr">(Liu et al., 2023b)</ref>, MMBench-Chinese (MMB CN ) <ref type="bibr">(Liu et al., 2023b)</ref>, SEED <ref type="bibr">(Li et al., 2023b)</ref>, LLaVA-in-the-Wild (LLaVA W ) <ref type="bibr">(Liu et al., 2023a)</ref>, and MM-Vet <ref type="bibr">(Yu et al., 2023)</ref>.</p><p>VQA v2 <ref type="bibr">(Goyal et al., 2017)</ref> and VizWiz <ref type="bibr">(Gurari et al., 2018)</ref> are benchmarks for traditional Visual Question Answering (VQA) tasks. MME <ref type="bibr">(Fu et al., 2023)</ref> evaluates LLMs' assesses and cognition capabilities through a wide range of carefully crafted questions across 14 sub-tasks. MMBench (MMB) and MMBench-Chinese (MMB CN ) <ref type="bibr">(Liu et al., 2023b)</ref> benchmarks manually design ques- Datasets Metrics Ours Ferret Shikra InstructBLIP MiniGPT4 LLaVA MM-GPT mPLUG-Owl Random Accuracy (&#8593;) 91.00 90.24 86.90 88.57 79.67 88.00 50.10 53.97 Precision (&#8593;) 98.05 97.72 94.40 84.09 78.24 97.44 50.05 52.07 Recall (&#8593;) 84.10 83.00 79.26 95.13 82.20 78.80 100.00 99.60 F1 Score (&#8593;) 90.93 89.76 86.19 89.27 80.17 87.13 66.71 68.39 Popular Accuracy (&#8593;) 90.16 84.90 83.97 82.77 69.73 87.43 50.00 50.90 Precision (&#8593;) 95.46 88.24 87.55 76.27 65.86 95.24 50.00 50.46 Recall (&#8593;) 84.20 80.53 79.20 95.13 81.93 78.80 100.00 99.40 F1 Score (&#8593;) 90.56 84.21 83.16 84.66 73.02 86.24 66.67 66.94 Adversarial Accuracy (&#8593;) 88.43 82.36 83.10 72.10 65.17 85.50 50.00 50.67 Precision (&#8593;) 91.99 83.60 85.60 65.13 61.19 90.99 50.00 50.34 Recall (&#8593;) 84.20 80.53 79.60 95.13 82.93 78.80 100.00 99.33 F1 Score (&#8593;) 87.63 82.00 82.49 77.32 70.42 84.45 66.67 66.82 Average F1 89.50 85.32 83.94 83.75 74.53 85.94 66.68 67.38</p><p>Table <ref type="table">11</ref>: Results on POPE. We outperform competing baselines including Ferret <ref type="bibr">(You et al., 2023)</ref>, which is finetuned on grounding and referring data.</p><p>tions to evaluate the LLM's visual reasoning and perception abilities in English and Chinese, respectively. SEED <ref type="bibr">(Li et al., 2023b</ref>) generated a dataset comprising around 19K questions with images and videos with the GPT4 assistance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H Analysis on Choices of Datastores</head><p>From Tab. 6 and previous analysis we know that the quality of retrieved tags is critical. Therefore, the datastore, where the images are (with corresponding tags) retrieved from is crucial. Here we study how different choices of datastores can affect the model performance. In the default setting, we use a combination of CC12M <ref type="bibr">(Changpinyo et al., 2021)</ref>, CC3M <ref type="bibr">(Sharma et al., 2018)</ref> and COCO training set <ref type="bibr">(Lin et al., 2014)</ref>. Two of the three retrieval datasets, CC3M and the COCO training set, share overlaps with the LLaVA training data, which is a frequent scenario in retrieval-augmented generation, where a datastore with full or partial overlap with the training data is common <ref type="bibr">(Ramos et al., 2023b,a;</ref><ref type="bibr">Yang et al., 2023;</ref><ref type="bibr">Hu et al., 2023;</ref><ref type="bibr">Lin et al., 2024;</ref><ref type="bibr">Li et al., 2023c)</ref>. While CC12M and CC3M are different in size but similar in content style, COCO is different from them in both size and content. CC12M and CC3M consist of web image-text pairs, where the variance in caption quality and style is more significant. In COCO, captions are human-written, where the language style is more coherent, usually a short and plain description of the image. Consequently, tags extracted from COCO captions are often commonly used words and phrases and are very general, for instance, "boy", "girl", "plane" and "train", etc. It can provide the existence of objects in the image, which might help to alleviate the mention of non-existent objects. However, it is hard help to improve object or entity identification as these commonly seen phrases are very likely to be already included in LLaVA training data and new retrieval mappings cannot be established. On the contrary, CC12M and CC3M provide an ocean of novel ob-jects and entities, which could greatly improve the image-to-text translation process with additional new retrieval mappings built from them. We are curious to see how different datastore size and datastore style can influence our model performance. In additional to the default setting, we perform the tag-grounded instruction tuning with different datastores, and use them for retrieval during inference, respectively. Results are available in Tab. 7.</p><p>It is not surprising that the default setting with largest datastore size consistently outperforms other baselines. In most cases, the baseline with CC12M is the second best one while the one with COCO training set performs worst, except on POPE. This is because POPE is built with COCO validation set, which shares the same style of the COCO training set. On other multimodal benchmarks, the improvements with COCO training set is less than CC12M and CC3M. Particularly, in LLaVA-in-the-Wild (LLaVA-W) benchmark, where all test images are not overlapped with COCO training and validation set, COCO training set as datastore does not help at all.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://llava-vl.github.io</p></note>
		</body>
		</text>
</TEI>
