NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

LoCoRe: Image Re-Ranking with Long-Context Sequence Modeling

https://doi.org/10.1109/CVPR52734.2025.00895

Xiao, Zilin; Suma, Pavel; Sachdeva, Ayush; Wang, Hao-Jen; Kordopatis-Zilos, Giorgos; Tolias, Giorgos; Ordonez, Vicente (June 2025, IEEE Computer Vision and Pattern Recognition (CVPR))

Full Text Available
PropTest: Automatic Property Testing for Improved Visual Programming

https://doi.org/10.18653/v1/2024.findings-emnlp.483

Koo, Jaywon; Yang, Ziyan; Cascante-Bonilla, Paola; Ray, Baishakhi; Ordonez, Vicente (November 2024, Findings of the Association for Computational Linguistics)

Full Text Available
ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente (September 2024, European Conference on Computer Vision (ECCV), Springer, Cham)

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global representation obtained by pooling the local features learned under an MAE reconstruction loss and using this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method.
more » « less
Full Text Available
Grounding Language Models for Visual Entity Recognition

Xiao, Z; Gong, M; Cascante-Bonilla, P; Zhang, X; Wu, J; Ordonez, V (September 2024, ECCV 2024. Lecture Notes in Computer Science, vol 15069. Springer, Cham.)

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multimodal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visual reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wikibenchmark with accuracy on the Entity seen split rising from 32.7% to 61.5%. It demonstrates superior performance on the unseen and query splits by a substantial double-digit margin, while also preserving the ability to effectively transfer to other generic visual question answering benchmarks without further training.
more » « less
Full Text Available
ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente (September 2024, European Conference on Computer Vision (ECCV))

Full Text Available
Grounding Language Models for Visual Entity Recognition

Xiao, Zilin; Gong, Ming; Cascante-Bonilla, Paola; Zhang, Xingyao; Wu, Jie; Ordonez, Vicente (September 2024, European Conference on Computer Vision (ECCV))

Full Text Available
ElasticDiffusion: Training-Free Arbitrary Size Image Generation Through Global-Local Content Separation

https://doi.org/10.1109/CVPR52733.2024.00631

Haji-Ali, Moayed; Balakrishnan, Guha; Ordonez, Vicente (June 2024, IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Improved Visual Grounding through Self-Consistent Explanations

https://doi.org/10.1109/CVPR52733.2024.01244

He, Ruozhen; Cascante-Bonilla, Paola; Yang, Ziyan; Berg, Alexander C; Ordonez, Vicente (June 2024, IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
SCoRD: Subject-Conditional Relation Detection with Text-Augmented Data

https://doi.org/10.1109/WACV57701.2024.00563

Yang, Ziyan; Kafle, Kushal; Lin, Zhe; Cohen, Scott; Ding, Zhihong; Ordonez, Vicente (January 2024, IEEE)

Full Text Available
Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations

https://doi.org/10.1109/CVPR52729.2023.01837

Yang, Ziyan; Kafle, Kushal; Dernoncourt, Franck; Ordonez, Vicente (June 2023, IEEE Conference on Computer Vision and Pattern Recognition)

Full Text Available

« Prev Next »

Search for: All records