NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Ge, Yunhao; Tang, Yihe; Xu, Jiashu; Gokmen, Cem; Li, Chengshu; Ai, Wensi; Martinez, Benjamin Jose; Aydin, Arman; Anvari, Mona; Chakravarthy, Ayush K; et al (June 2024, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; et al (October 2022, NeurIPS)

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.
more » « less
Full Text Available
Missingness Bias in Model Debugging

Jain, Saachi; Salman, Hadi; Wong, Eric; Zhang, Pengchuan; Vineet, Vibhav; Vemprala, Sai; Madry, Aleksander (January 2022, International Conference on Learning Representations)

Full Text Available
Multiscale Invertible Generative Networks for High-Dimensional Bayesian Inference

Zhang, Shumao; Zhang, Pengchuan; Hou, Thomas Y. (January 2021, Proceedings of the 38 th International Conference on Machine Learning, PMLR)
Meila, Marina and (Ed.)
We propose a Multiscale Invertible Generative Network (MsIGN) and associated training algorithm that leverages multiscale structure to solve high-dimensional Bayesian inference. To address the curse of dimensionality, MsIGN exploits the low-dimensional nature of the posterior, and generates samples from coarse to fine scale (low to high dimension) by iteratively upsampling and refining samples. MsIGN is trained in a multistage manner to minimize the Jeffreys divergence, which avoids mode dropping in high-dimensional cases. On two high-dimensional Bayesian inverse problems, we show superior performance of MsIGN over previous approaches in posterior approximation and multiple mode capture. On the natural image synthesis task, MsIGN achieves superior performance in bits-per-dimension over baseline models and yields great interpret-ability of its neurons in intermediate layers.
more » « less
Full Text Available
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Li, Chunyuan; Liu, Haotian; Li, Harold; Zhang, Pengchuan; Aneja, Jyoti; Yang, Jianwei; Jin, Ping; Hu, Houdong; Liu, Zicheng; Lee, Yong Jae; et al (January 2022, Neural Information Processing Systems (NeurIPS))

Full Text Available

Search for: All records