NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Cheng, Zixu; Pu, Yujiang; Gong, Shaogang; Kordjamshidi, Parisa; Kong, Yu (October 2024, Springer)

Temporal grounding, also known as video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language modeldriven method for negative query construction, utilizing GPT-3.5 Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at https://github.com/zxccade/SHINE.
more » « less
Full Text Available
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning

Bao, Wentao; Chen, Lichang; Huang, Heng; Kong, Yu (August 2024, European Conference on Computer Vision (ECCV 2024))

Full Text Available
Facial Affective Behavior Analysis with Instruction Tuning

https://doi.org/10.1007/978-3-031-72649-1_10

Li, Yifan; Dao, Anh; Bao, Wentao; Tan, Zhen; Chen, Tianlong; Liu, Huan; Kong, Yu (September 2024, Springer Nature Switzerland)

Full Text Available
SHINE: Saliency-aware HIerarchical NEgative Ranking for Compositional Temporal Grounding

Cheng, Zixu; Pu, Yujiang; Gong, Shaogang; Kordjamshidi, Parisa; Kong, Yu (July 2024, arXiv)

Temporal grounding, a.k.a video moment retrieval, aims at locating video segments corresponding to a given query sentence. The compositional nature of natural language enables the localization beyond predefined events, posing a certain challenge to the compositional generalizability of existing methods. Recent studies establish the correspondence between videos and queries through a decompose-reconstruct manner to achieve compositional generalization. However, they only consider dominant primitives and build negative queries through random sampling and recombination, resulting in semantically implausible negatives that hinder the models from learning rational compositions. In addition, recent DETR-based methods still underperform in compositional temporal grounding, showing irrational saliency responses when given negative queries that have subtle differences from positive queries. To address these limitations, we first propose a large language model-driven method for negative query construction, utilizing GPT-3.5-Turbo to generate semantically plausible hard negative queries. Subsequently, we introduce a coarse-to-fine saliency ranking strategy, which encourages the model to learn the multi-granularity semantic relationships between videos and hierarchical negative queries to boost compositional generalization. Extensive experiments on two challenging benchmarks validate the effectiveness and generalizability of our proposed method. Our code is available at this https URL.
more » « less
Full Text Available
ATM: Action Temporality Modeling for Video Question Answering

Chen, Junwen; Zhu, Jie; Kong, Yu (October 2023, ACM)

Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms existing approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.
more » « less
Full Text Available
GateHUB: Gated History Unit With Background Suppression for Online Action Detection

Chen, Junwen; Mittal, Gaurav; Yu, Ye; Kong, Yu; Chen, Mei (June 2022, IEEE Computer Society Conference on Computer Vision and Pattern Recognition)

Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We present GateHUB, Gated History Unit with Background Suppression, that comprises a novel position-guided gated cross-attention mechanism to enhance or suppress parts of the history as per how informative they are for current frame prediction. GateHUB further proposes Future-augmented History (FaH) to make history features more informative by using subsequently observed frames when available. In a single unified framework, GateHUB integrates the transformer's ability of long-range temporal modeling and the recurrent model's capacity to selectively encode relevant information. GateHUB also introduces a background suppression objective to further mitigate false positive background frames that closely resemble the action frames. Extensive validation on three benchmark datasets, THUMOS, TVSeries, and HDD, demonstrates that GateHUB significantly outperforms all existing methods and is also more efficient than the existing best work. Furthermore, a flow-free version of GateHUB is able to achieve higher or close accuracy at 2.8x higher frame rate compared to all existing methods that require both RGB and optical flow information for prediction.
more » « less
Full Text Available
A Dynamic Meta-Learning Model for Time-Sensitive Cold-Start Recommendations

Neupane, Krishna; Zheng, Ervine; Kong, Yu; Yu, Qi (February 2022, Proceedings of the AAAI Conference on Artificial Intelligence)

Full Text Available
Explainable Video Entailment with Grounded Visual Evidence

https://doi.org/10.1109/ICCV48922.2021.00203

Chen, Junwen; Kong, Yu (October 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))

Video entailment aims at determining if a hypothesis textual statement is entailed or contradicted by a premise video. The main challenge of video entailment is that it requires fine-grained reasoning to understand the complex and long story-based videos. To this end, we propose to incorporate visual grounding to the entailment by explicitly linking the entities described in the statement to the evidence in the video. If the entities are grounded in the video, we enhance the entailment judgment by focusing on the frames where the entities occur. Besides, in the entailment dataset, the entailed/contradictory (also named as real/fake) statements are formed in pairs with subtle discrepancy, which allows an add-on explanation module to predict which words or phrases make the statement contradictory to the video and regularize the training of the entailment judgment. Experimental results demonstrate that our approach outperforms the state-of-the-art methods.
more » « less
Full Text Available
Universal 3-Dimensional Perturbations for Black-Box Attacks on Video Recognition Systems

https://doi.org/10.1109/SP46214.2022.00025

Xie, Shangyu; Wang, Han; Kong, Yu; Hong, Yuan (January 2022, In Proceedings of the 43rd IEEE Symposium on Security and Privacy (Oakland'22))

Full Text Available
Deep Geo-Constrained Auto-Encoder for Non-Landmark GPS Estimation

https://doi.org/10.1109/TBDATA.2017.2773096

Jiang, Suhui; Kong, Yu; Fu, Yun (June 2019, IEEE Transactions on Big Data)

Full Text Available

« Prev Next »

Search for: All records