- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources5
- Resource Type
-
0004000001000000
- More
- Availability
-
50
- Author / Contributor
- Filter by Author / Creator
-
-
Dou, Zi-Yi (5)
-
Anastasopoulos, Antonios (2)
-
Gan, Zhe (2)
-
Gao, Jianfeng (2)
-
Li, Linjie (2)
-
Peng, Nanyun (2)
-
Wang, Jianfeng (2)
-
Wang, Lijuan (2)
-
Barman-Adhikari, Anamika (1)
-
Behl, Harkirat (1)
-
Dai, Xiyang (1)
-
Fang, Fei (1)
-
Hu, Junjie (1)
-
Kamath, Aishwarya (1)
-
LeCun, Yann (1)
-
Lee, Yong Jae (1)
-
Li, Chunyuan (1)
-
Liu, Ce (1)
-
Liu, Zicheng (1)
-
Neubig, Graham (1)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; et al (, NeurIPS)Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.more » « less
-
Dou, Zi-Yi; Barman-Adhikari, Anamika; Fang, Fei; Yadav, Amulya (, Proceedings of the AAAI Conference on Artificial Intelligence)
-
Dou, Zi-Yi; Yu, Keyi; Anastasopoulos, Antonios (, Proceedings of the Conference on Empirical Methods in Natural Language Processing (Demo Track))
-
Dou, Zi-Yi; Hu, Junjie; Anastasopoulos, Antonios; Neubig, Graham (, Proceedings of the Conference on Empirical Methods in Natural Language Processing (Demo Track))
An official website of the United States government

Full Text Available