NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Generalized Decoding for Pixel, Image, and Language

https://doi.org/10.1109/CVPR52729.2023.01451

Zou, Xueyan; Dou, Zi-Yi; Yang, Jianwei; Gan, Zhe; Li, Linjie; Li, Chunyuan; Dai, Xiyang; Behl, Harkirat; Wang, Jianfeng; Yuan, Lu; et al (June 2023, IEEE)

Full Text Available
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; et al (October 2022, NeurIPS)

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.
more » « less
Full Text Available
Harnessing Social Media to Identify Homeless Youth At-Risk of Substance Use

Dou, Zi-Yi; Barman-Adhikari, Anamika; Fang, Fei; Yadav, Amulya (May 2021, Proceedings of the AAAI Conference on Artificial Intelligence)

Full Text Available
Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks

https://doi.org/10.18653/v1/D19-1112

Dou, Zi-Yi; Yu, Keyi; Anastasopoulos, Antonios (November 2019, Proceedings of the Conference on Empirical Methods in Natural Language Processing (Demo Track))

Full Text Available
Unsupervised Domain Adaptation for Neural Machine Translation with Domain-Aware Feature Embeddings

https://doi.org/10.18653/v1/D19-1147

Dou, Zi-Yi; Hu, Junjie; Anastasopoulos, Antonios; Neubig, Graham (November 2019, Proceedings of the Conference on Empirical Methods in Natural Language Processing (Demo Track))

Full Text Available

Search for: All records