skip to main content

Search for: All records

Creators/Authors contains: "Tang, Hao"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available September 1, 2024
  2. Free, publicly-accessible full text available June 1, 2024
  3. Free, publicly-accessible full text available June 18, 2024
  4. Free, publicly-accessible full text available June 18, 2024
  5. Online classes are typically conducted by using video conferencing software such as Zoom, Microsoft Teams, and Google Meet. Research has identified drawbacks of online learning, such as “Zoom fatigue”, characterized by distractions and lack of engagement. This study presents the CUNY Affective and Responsive Virtual Environment (CARVE) Hub, a novel virtual reality hub that uses a facial emotion classification model to generate emojis for affective and informal responsive interaction in a 3D virtual classroom setting. A web-based machine learning model is employed for facial emotion classification, enabling students to communicate four basic emotions live through automated web camera capture in a virtual classroom without activating their cameras. The experiment is conducted in undergraduate classes on both Zoom and CARVE, and the results of a survey indicate that students have a positive perception of interactions in the proposed virtual classroom compared with Zoom. Correlations between automated emojis and interactions are also observed. This study discusses potential explanations for the improved interactions, including a decrease in pressure on students when they are not showing faces. In addition, video panels in traditional remote classrooms may be useful for communication but not for interaction. Students favor features in virtual reality, such as spatial audio and the ability to move around, with collaboration being identified as the most helpful feature. 
    more » « less
    Free, publicly-accessible full text available March 1, 2024
  6. With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications. Vision Transformers (ViTs) have shown considerably stronger results for many vision tasks. However, ViTs with the fullattention mechanism usually consume a large number of computational resources, leading to difficulties for realtime inference on edge devices. In this paper, we aim to derive ViTs with fewer computations and fast inference speed to facilitate the dense prediction of semantic segmentation on edge devices. To achieve this, we propose a pruning parameterization method to formulate the pruning problem of semantic segmentation. Then we adopt a bi-level optimization method to solve this problem with the help of implicit gradients. Our experimental results demonstrate that we can achieve 38.9 mIoU on ADE20K val with a speed of 56.5 FPS on Samsung S21, which is the highest mIoU under the same computation constraint with real-time inference. 
    more » « less
    Free, publicly-accessible full text available June 1, 2024
  7. Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference, while time-consuming training is still unavoidable. In contrast, this paper points out that the million-scale training data is redundant, which is the fundamental reason for the tedious training. To address the issue, this paper aims to introduce sparsity into data and proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy reduction scheme, by exploring the sparsity under three levels: number of training examples in the dataset, number of patches (tokens) in each example, and number of connections between tokens that lie in attention weights. With extensive experiments, we demonstrate that our proposed technique can noticeably accelerate training for various ViT architectures while maintaining accuracy. Remarkably, under certain ratios, we are able to improve the ViT accuracy rather than compromising it. For example, we can achieve 15.2% speedup with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1) Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT. Our code
is released at

    more » « less
    Free, publicly-accessible full text available June 27, 2024
  8. Contextual information has been widely used in many computer vision tasks. However, existing approaches design specific contextual information mechanisms for different tasks. In this work, we propose a general context learning and reasoning framework for object detection tasks with three components: local contextual labeling, contextual graph generation and spatial contextual reasoning. With simple user defined parameters, local contextual labeling automatically enlarge the small object labels to include more local contextual information. A Graph Convolutional Network learns over the generated contextual graph to build a semantic space. A general spatial relation is used in spatial contextual reasoning to optimize the detection results. All three components can be easily added and removed from a standard object detector. In addition, our approach also automates the training process to find the optimal combinations of user defined parameters. The general framework can be easily adapted to different tasks. In this paper we compare our framework with a previous multistage context learning framework specifically designed for storefront accessibility detection and a state of the art detector for pedestrian detection. Experimental results on two urban scene datasets demonstrate that our proposed general framework can achieve same performance as the specifically designed multistage framework on storefront accessibility detection, and with improved performance on pedestrian detection over the state of art detector. 
    more » « less