skip to main content


Search for: All records

Creators/Authors contains: "Le, Ngan"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Video anomaly detection (VAD) – commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature – is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal de- pendencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https:// github.com/joos2010kj/CLIP-TSA. 
    more » « less
    Free, publicly-accessible full text available October 8, 2024
  2. Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/VLTinT. 
    more » « less
    Free, publicly-accessible full text available June 27, 2024
  3. In this paper, we present a novel Deformable Neural Articulations Network (DNA-Net), which is a template- free learning-based method for dynamic 3D human reconstruction from a single RGB-D sequence. Our proposed DNA-Net includes a Neural Articulation Prediction Net- work (NAP-Net), which is capable of representing non-rigid motions of a human by learning to predict a set of articulated bones to follow movements of the human in the in- put sequence. Moreover, DNA-Net also include Signed Distance Field Network (SDF-Net) and Appearance Network (Color-Net), which take advantage of the powerful neural implicit functions in modeling 3D geometries and appear- ance. Finally, to avoid the reliance on external optical flow estimators to obtain deformation cues like previous related works, we propose a novel training loss, namely Easy-to- Hard Geometric-based, which is a simple strategy that inherits the merits of Chamfer distance to achieve good de- formation guidance while still avoiding its limitation of lo- cal mismatches sensitivity. DNA-Net is trained end-to-end in a self-supervised manner directly on the input sequence to obtain 3D reconstructions of the input objects. Quantitative results on videos of DeepDeform dataset show that DNA-Net outperforms related state-of-the-art methods with an adequate gaps, qualitative results additionally prove that our method can reconstruct human shapes with high fidelity and details. 
    more » « less
    Free, publicly-accessible full text available June 1, 2024
  4. Free, publicly-accessible full text available June 1, 2024
  5. Free, publicly-accessible full text available July 1, 2024