A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well.
more »
« less
Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries
Visual content memorability has intrigued the scientific community for decades, with applications spanning from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the high cost of collecting memorability annotations from humans, which constrains both the diversity and scalability of available datasets. Existing datasets typically provide only aggregate memorability scores for visual content, overlooking the nuanced signals embedded in natural, open-ended recall descriptions. In this work, we introduce the first large-scale, unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos paired with descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that this unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. In addition, we employ a contrastive training strategy to create the first model capable of multimodal ToT retrieval. Our dataset and models present a new research direction and provide scalable tools for advancing work on visual content memorability.
more »
« less
- Award ID(s):
- 2234195
- PAR ID:
- 10665820
- Publisher / Repository:
- Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
- Date Published:
- Format(s):
- Medium: X
- Location:
- Tucson, Arizona
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a misalignment between image regions and textual descriptions, which stems from CLIP's global alignment objective. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach.more » « less
-
Introduction. Generative artificial intelligence tools, like ChatGPT, are an increasingly utilised resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We employ a dataset containing human-annotated comments from YouTube and X. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Analysis. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Results. Qualitative analysis reveals four specific findings: 1) cultural euphemisms are too nuanced for GPT-4 to understand, 2) interpreting the type of ’internet speak’ found on social media platforms is a challenge, 3) GPT-4 falters in determining who or what is the target of directed attacks (e.g., the content or the user), and 4) the rationale GPT-4 provides is inconsistent in logic. Conclusion. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudencemore » « less
-
The lack of high-quality ground truth datasets to train machine learning (ML) models impedes the potential of artificial intelligence for science (AI for science) to accelerate research. It is infeasible to manually review all relevant scientific publications and extract information from unstructured text to construct ground truth datasets. Scientific information extraction (SIE) from the literature using LLMs is emerging as a powerful approach to automate the creation of these datasets. However, existing LLM-based approaches and benchmarking studies for SIE focus on broad topics such as biomedicine and chemistry. These methods are limited to choice-based SIE tasks where the LLM is instructed to select the correct answer from several options provided in the prompt. Moreover, these tasks focus on extracting information from short and well-formatted text such as a sentence or an abstract. The potential of SIE methods in complex, open-ended tasks is considerably under-explored. In this study, we used a domain that has been virtually ignored in SIE, namely virology, to address these research gaps. We first designed a unique, open-ended SIE task of extracting mutations in a given virus that modify its interaction with the host. Next, we developed a new, multi-step retrieval augmented generation (RAG) framework called VILLA for SIE. In parallel, we curated a novel dataset of 629 mutations in ten influenza A virus proteins obtained from 463 scientific publications to serve as ground truth for our mutation extraction task. Finally, we demonstrated VILLA’s superior performance using a novel and comprehensive evaluation and comparison with vanilla RAG and other state-of-the art RAG- and agent-based tools involving both open and closed LLMs for SIE.more » « less
-
Systems for knowledge-intensive tasks such as open-domain question answering (QA) usually consist of two stages: efficient retrieval of relevant documents from a large corpus and detailed reading of the selected documents. This is usually done through two separate models, a retriever that encodes the query and finds nearest neighbors, and a reader based on Transformers. These two components are usually modeled separately, which necessitates a cumbersome implementation and is awkward to optimize in an end-to-end fashion. In this paper, we revisit this design and eschew the separate architecture and training in favor of a single Transformer that performs retrieval as attention (RAA), and end-to-end training solely based on supervision from the end QA task. We demonstrate for the first time that an end-to-end trained single Transformer can achieve both competitive retrieval and QA performance on in-domain datasets, matching or even slightly outperforming state-of-the-art dense retrievers and readers. Moreover, end-to-end adaptation of our model significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings, making our model a simple and adaptable end-to-end solution for knowledge-intensive tasks.more » « less
An official website of the United States government

