NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Improving Label Assignments Learning by Dynamic Sample Dropout Combined with Layer-wise Optimization in Speech Separation

https://doi.org/10.21437/Interspeech.2023-1172

Gao, Chenyang; Gu, Yue; Marsic, Ivan (August 2023, ISCA)

Full Text Available
Tuber: Tubelet transformer for video action detection

https://doi.org/10.1109/CVPR52688.2022.01323

Zhao, Jiaojiao; Zhang, Yanyi; Li, Xinyu; Chen, Hao; Shuai, Bing; Xu, Mingze; Liu, Chunhui; Kundu, Kaustav; Xiong, Yuanjun; Modolo, Davide; et al (June 2022, IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops)

We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/).
more » « less
Full Text Available
Real-time Context-Aware Multimodal Network for Activity and Activity-Stage Recognition from Team Communication in Dynamic Clinical Settings

https://doi.org/10.1145/3580798

Gao, Chenyang; Marsic, Ivan; Sarcevic, Aleksandra; Gestrich-Thompson, Waverly; Burd, Randall S. (March 2022, Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies)

In clinical settings, most automatic recognition systems use visual or sensory data to recognize activities. These systems cannot recognize activities that rely on verbal assessment, lack visual cues, or do not use medical devices. We examined speech-based activity and activity-stage recognition in a clinical domain, making the following contributions. (1) We collected a high-quality dataset representing common activities and activity stages during actual trauma resuscitation events-the initial evaluation and treatment of critically injured patients. (2) We introduced a novel multimodal network based on audio signal and a set of keywords that does not require a high-performing automatic speech recognition (ASR) engine. (3) We designed novel contextual modules to capture dynamic dependencies in team conversations about activities and stages during a complex workflow. (4) We introduced a data augmentation method, which simulates team communication by combining selected utterances and their audio clips, and showed that this method contributed to performance improvement in our data-limited scenario. In offline experiments, our proposed context-aware multimodal model achieved F1-scores of 73.2±0.8% and 78.1±1.1% for activity and activity-stage recognition, respectively. In online experiments, the performance declined about 10% for both recognition types when using utterance-level segmentation of the ASR output. The performance declined about 15% when we omitted the utterance-level segmentation. Our experiments showed the feasibility of speech-based activity and activity-stage recognition during dynamic clinical events.
more » « less
Full Text Available
Real-time medical phase recognition using long-term video understanding and progress gate method

https://doi.org/10.1016/j.media.2021.102224

Zhang, Yanyi; Marsic, Ivan; Burd, Randall S. (December 2021, Medical Image Analysis)

Full Text Available
Multi-label activity recognition using activity-specific features and activity correlations

https://doi.org/10.1109/CVPR46437.2021.01439

Zhang Y, Li X (June 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition)

Full Text Available
Vidtr: Video transformer without convolutions

https://doi.org/10.1109/ICCV48922.2021.01332

Zhang Y, Li X (January 2021, Proceedings of the IEEE International Conference on Computer Vision)

We introduce Video Transformer (VidTr) with separableattention for video classification. Comparing with commonly used 3D networks, VidTr is able to aggregate spatiotemporal information via stacked attentions and provide better performance with higher efficiency. We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage. We then present VidTr which reduces the memory cost by 3.3× while keeping the same performance. To further optimize the model, we propose the standard deviation based topK pooling for attention (pooltopK std), which reduces the computation by dropping non-informative features along temporal dimension. VidTr achieves state-of-the-art performance on five commonly used datasets with lower computational requirement, showing both the efficiency and effectiveness of our design. Finally, error analysis and visualization show that VidTr is especially good at predicting actions that require long-term temporal reasoning.
more » « less
Full Text Available
Video-based Concurrent Activity Recognition for Trauma Resuscitation

https://doi.org/10.1109/ICHI48887.2020.9374399

Zhang, Yanyi; Gu, Yue; Marsic, Ivan; Zheng, Yinan; Burd, Randall S. (November 2020, 2020 IEEE International Conference on Healthcare Informatics (ICHI))
null (Ed.)
Full Text Available
Speech-Based Activity Recognition for Trauma Resuscitation

https://doi.org/10.1109/ICHI48887.2020.9374372

Abdulbaqi, Jalal; Gu, Yue; Xu, Zhichao; Gao, Chenyang; Marsic, Ivan; Burd, Randall S. (November 2020, 2020 IEEE International Conference on Healthcare Informatics (ICHI))
null (Ed.)
Full Text Available
Residual Recurrent Neural Network for Speech Enhancement

https://doi.org/10.1109/ICASSP40776.2020.9053544

Abdulbaqi, Jalal; Gu, Yue; Chen, Shuhong; Marsic, Ivan (January 2020, IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020))
null (Ed.)
Full Text Available
Multimodal Attention Network for Trauma Activity Recognition from Spoken Language and Environmental Sound

https://doi.org/10.1109/ICHI.2019.8904713

Gu, Yue; Zhang, Ruiyu; Zhao, Xinwei; Chen, Shuhong; Abdulbaqi, Jalal; Marsic, Ivan; Cheng, Megan; Burd, Randall S. (June 2019, 2019 IEEE International Conference on Healthcare Informatics (ICHI))
null (Ed.)
Full Text Available

Search for: All records