NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

https://doi.org/10.1007/978-3-031-73404-5_20

Chatterjee, Agneet; Luo, Yiran; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (October 2024, Springer Nature Switzerland)

Full Text Available
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation

https://doi.org/10.1109/CVPR52733.2024.00270

Chatterjee, Agneet; Gokhale, Tejas; Baral, Chitta; Yang, Yezhou (June 2024, IEEE)

Full Text Available
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations

https://doi.org/10.1109/CVPR52733.2024.00866

Patel, Maitreya; Kim, Changhoon; Cheng, Sheng; Baral, Chitta; Yang, Yezhou (June 2024, IEEE)

Full Text Available
Getting it Right: Improving Spatial Consistency in Text-to-Image Models

https://doi.org/10.1007/978-3-031-72670-5_12

Chatterjee, Agneet; Stan, Gabriela_Ben Melech; Aflalo, Estelle; Paul, Sayak; Ghosh, Dhruba; Gokhale, Tejas; Schmidt, Ludwig; Hajishirzi, Hannaneh; Lal, Vasudev; Baral, Chitta; et al (September 2024, Springer Nature Switzerland)

Full Text Available
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

https://doi.org/10.1609/aaai.v38i13.29371

Patel, Maitreya; Gokhale, Tejas; Baral, Chitta; Yang, Yezhou (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)

The ability to understand visual concepts and replicate and compose these concepts from images is a central goal for computer vision. Recent advances in text-to-image (T2I) models have lead to high definition and realistic image quality generation by learning from large databases of images and their descriptions. However, the evaluation of T2I models has focused on photorealism and limited qualitative measures of visual understanding. To quantify the ability of T2I models in learning and synthesizing novel visual concepts (a.k.a. personalized T2I), we introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts, and 33K composite text prompts. Along with the dataset, we propose an evaluation metric, Concept Confidence Deviation (CCD), that uses the confidence of oracle concept classifiers to measure the alignment between concepts generated by T2I generators and concepts contained in target images. We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions. Our human study shows that CCD is highly correlated with human understanding of concepts. Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome. The data, code, and interactive demo is available at: https://conceptbed.github.io/
more » « less
Full Text Available
Collaborative large language models for automated data extraction in living systematic reviews

https://doi.org/10.1093/jamia/ocae325

Khan, Muhammad Ali; Ayub, Umair; Naqvi, Syed_Arsalan Ahmed; Khakwani, Kaneez_Zahra Rubab; Sipra, Zaryab_bin Riaz; Raina, Ammad; Zhou, Sihan; He, Huan; Saeidi, Amir; Hasan, Bashar; et al (January 2025, Journal of the American Medical Informatics Association)

Abstract ObjectiveData extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process. Materials and MethodsA dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance. ResultsIn the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76. DiscussionConcordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy. ConclusionLarge language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews.
more » « less
Free, publicly-accessible full text available January 21, 2026
"Len or index or count, anything but v1": Predicting Variable Names in Decompilation Output with Transfer Learning

Pal, Kuntal Kumar; Bajaj, Ati Priya; Banerjee, Pratyay; Dutcher, Audrey; Nakamura, Mutsumi; Basque, Zion Leonahenahe; Gupta, Himanshu; Sawant, Saurabh Arjun; Anantheswaran, Ujjwala; Shoshitaishvili, Yan; et al (May 2024, IEEE Computer Society)

Full Text Available
End-to-end Knowledge Retrieval with Multi-modal Queries

Luo, Man; Fang, Zhiyuan; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (July 2023, 61st Annual Meeting of the Association for Computational Linguistics)

We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model “ReViz” that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion without being dependent on intermediate modules such as object detectors or caption generators. We introduce a new pretraining task that is effective for learning knowledge retrieval with multimodal queries and also improves performance on downstream tasks. We demonstrate superior performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot settings as well as further improvements when finetuned on these datasets.
more » « less
Full Text Available
Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

Sampat, Shailaja; Banerjee, Pratyay; Yang, Yezhou; and Baral, Chitta. (December 2022, Findings of EMNLP 2022.)

Actions’ play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform ‘Reasoning about Actions & Change’ (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
more » « less
Full Text Available
Improving Diversity with Adversarially Learned Transformations for Domain Generalization

https://doi.org/10.1109/WACV56688.2023.00051

Gokhale, Tejas; Anirudh, Rushil; Thiagarajan, Jayaraman J.; Kailkhura, Bhavya; Baral, Chitta; Yang, Yezhou (January 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))

Full Text Available

« Prev Next »

Search for: All records