NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

https://doi.org/10.1109/WACV56688.2023.00485

Chu, Peng; Wang, Jiang; You, Quanzeng; Ling, Haibin; Liu, Zicheng (January 2023, IEEE/CVF Winter Conference on Applications of Computer Vision)

Full Text Available
Injecting Semantic Concepts into End-to-End Image Captioning

https://doi.org/10.1109/CVPR52688.2022.01748

Fang, Zhiyuan; Wang, Jianfeng; Hu, Xiaowei; Liang, Lin; Gan, Zhe; Wang, Lijuan; Yang, Yezhou; Liu, Zicheng (June 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Dou, Zi-Yi; Kamath, Aishwarya; Gan, Zhe; Zhang, Pengchuan; Wang, Jianfeng; Li, Linjie; Liu, Zicheng; Liu, Ce; LeCun, Yann; Peng, Nanyun; et al (October 2022, NeurIPS)

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.
more » « less
Full Text Available
A novel GCN-based point cloud classification model robust to pose variances

https://doi.org/10.1016/j.patcog.2021.108251

Wang, Huafeng; Zhang, Yaming; Liu, Wanquan; Gu, Xianfeng; Jing, Xin; Liu, Zicheng (January 2022, Pattern Recognition)

Full Text Available
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Li, Chunyuan; Liu, Haotian; Li, Harold; Zhang, Pengchuan; Aneja, Jyoti; Yang, Jianwei; Jin, Ping; Hu, Houdong; Liu, Zicheng; Lee, Yong Jae; et al (January 2022, Neural Information Processing Systems (NeurIPS))

Full Text Available
Human Action Image Generation with Differential Privacy

https://doi.org/10.1109/ICME46284.2020.9102767

Sun, Mingxuan; Wang, Qing; Liu, Zicheng (July 2020, IEEE International Conference on Multimedia and Expo (ICME))
null (Ed.)
Full Text Available

Search for: All records