NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

POTLoc: Pseudo-label Oriented Transformer for point-supervised temporal Action Localization

https://doi.org/10.1016/j.cviu.2024.104044

Vahdani, Elahe; Tian, Yingli (September 2024, Computer Vision and Image Understanding)

This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of ‘pseudo-labels’ to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS’14 and ActivityNet-v1.2 datasets.
more » « less
Full Text Available
Sequential Point Clouds: A Survey

https://doi.org/10.1109/TPAMI.2024.3365970

Wang, Haiyan; Tian, Yingli (August 2024, IEEE Transactions on Pattern Analysis and Machine Intelligence)

Full Text Available
Multi-Modal Multi-Channel American Sign Language Recognition

https://doi.org/10.1142/S2972335324500017

Vahdani, Elahe; Jing, Longlong; Huenerfauth, Matt; Tian, Yingli (March 2024, International Journal of Artificial Intelligence and Robotics Research)

In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results.
more » « less
Full Text Available
Nonverbal Communication Cue Recognition: A Pathway to More Accessible Communication

Shafique, Zoya; Wang, Haiyan; Tian, Yingli. (June 2023, In proceedings of Women in Computer Vision Workshop in conjunction with IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

Nonverbal communication, such as body language, facial expressions, and hand gestures, is crucial to human communication as it conveys more information about emotions and attitudes than spoken words. However, individuals who are blind or have low-vision (BLV) may not have access to this method of communication, leading to asymmetry in conversations. Developing systems to recognize nonverbal communication cues (NVCs) for the BLV community would enhance communication and understanding for both parties. This paper focuses on developing a multimodal computer vision system to recognize and detect NVCs. To accomplish our objective, we are collecting a dataset focused on nonverbal communication cues. Here, we propose a baseline model for recognizing NVCs and present initial results on the Aff-Wild2 dataset. Our baseline model achieved an accuracy of 68% and a F1-Score of 64% on the Aff-Wild2 validation set, making it comparable with previous state of the art results. Furthermore, we discuss the various challenges associated with NVC recognition as well as the limitations of our current work.
more » « less
Full Text Available
Deep Learning-based Action Detection in Untrimmed Videos: A Survey

https://doi.org/10.1109/TPAMI.2022.3193611

Vahdani, Elahe; Tian, Yingli (April 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence)

Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Action detection in online setting is also reviewed where the goal is to detect actions in each frame without considering any future context in a live video stream. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.
more » « less
Full Text Available
AI-Driven Robust Kidney and Renal Mass Segmentation and Classification on 3D CT Images

https://doi.org/10.3390/bioengineering10010116

Liu, Jingya; Yildirim, Onur; Akin, Oguz; Tian, Yingli (January 2023, Bioengineering)

Early intervention in kidney cancer helps to improve survival rates. Abdominal computed tomography (CT) is often used to diagnose renal masses. In clinical practice, the manual segmentation and quantification of organs and tumors are expensive and time-consuming. Artificial intelligence (AI) has shown a significant advantage in assisting cancer diagnosis. To reduce the workload of manual segmentation and avoid unnecessary biopsies or surgeries, in this paper, we propose a novel end-to-end AI-driven automatic kidney and renal mass diagnosis framework to identify the abnormal areas of the kidney and diagnose the histological subtypes of renal cell carcinoma (RCC). The proposed framework first segments the kidney and renal mass regions by a 3D deep learning architecture (Res-UNet), followed by a dual-path classification network utilizing local and global features for the subtype prediction of the most common RCCs: clear cell, chromophobe, oncocytoma, papillary, and other RCC subtypes. To improve the robustness of the proposed framework on the dataset collected from various institutions, a weakly supervised learning schema is proposed to leverage the domain gap between various vendors via very few CT slice annotations. Our proposed diagnosis system can accurately segment the kidney and renal mass regions and predict tumor subtypes, outperforming existing methods on the KiTs19 dataset. Furthermore, cross-dataset validation results demonstrate the robustness of datasets collected from different institutions trained via the weakly supervised learning schema.
more » « less
Full Text Available
Robust and accurate pulmonary nodule detection with self-supervised feature learning on domain adaptation

https://doi.org/10.3389/fradi.2022.1041518

Liu, Jingya; Cao, Liangliang; Akin, Oguz; Tian, Yingli (December 2022, Frontiers in Radiology)

Medical imaging data annotation is expensive and time-consuming. Supervised deep learning approaches may encounter overfitting if trained with limited medical data, and further affect the robustness of computer-aided diagnosis (CAD) on CT scans collected by various scanner vendors. Additionally, the high false-positive rate in automatic lung nodule detection methods prevents their applications in daily clinical routine diagnosis. To tackle these issues, we first introduce a novel self-learning schema to train a pre-trained model by learning rich feature representatives from large-scale unlabeled data without extra annotation, which guarantees a consistent detection performance over novel datasets. Then, a 3D feature pyramid network ( 3DFPN ) is proposed for high-sensitivity nodule detection by extracting multi-scale features, where the weights of the backbone network are initialized by the pre-trained model and then fine-tuned in a supervised manner. Further, a High Sensitivity and Specificity ( HS 2 ) network is proposed to reduce false positives by tracking the appearance changes among continuous CT slices on Location History Images (LHI) for the detected nodule candidates. The proposed method’s performance and robustness are evaluated on several publicly available datasets, including LUNA16, SPIE-AAPM, LungTIME, and HMS. Our proposed detector achieves the state-of-the-art result of 90.6 % sensitivity at 1 / 8 false positive per scan on the LUNA16 dataset. The proposed framework’s generalizability has been evaluated on three additional datasets (i.e., SPIE-AAPM, LungTIME, and HMS) captured by different types of CT scanners.
more » « less
Full Text Available
Virtual contrast enhancement for CT scans of abdomen and pelvis

https://doi.org/10.1016/j.compmedimag.2022.102094

Liu, Jingya; Tian, Yingli; Duzgol, Cihan; Akin, Oguz; Ağıldere, A. Muhteşem; Haberal, K. Murat; Coşkun, Mehmet (September 2022, Computerized Medical Imaging and Graphics)

Full Text Available
PSMNet: Position-aware Stereo Merging Network for Room Layout Estimation

Haiyan Wang, Will Hutchcroft (June 2022, IEEE Conference on Computer Vision and Pattern Recognition (CVPR))

In this paper, we propose a new deep learning-based method for estimating room layout given a pair of 360◦ panoramas. Our system, called Position-aware Stereo Merging Network or PSMNet, is an end-to-end joint layout-pose estimator. PSMNet consists of a Stereo Pano Pose (SP2) transformer and a novel Cross-Perspective Projection (CP2) layer. The stereo-view SP2 transformer is used to implicitly infer correspondences between views, and can handle noisy poses. The pose-aware CP2 layer is designed to render features from the adjacent view to the anchor (reference) view, in order to perform view fusion and estimate the visible layout. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art layout estimators, especially for large and complex room spaces.
more » « less
Full Text Available
ASL-Homework-RGBD Dataset: An Annotated Dataset of 45 fluent and non-fluent Signers Performing American Sign Language Homeworks

Saad Hassan, Matthew Seita (June 2022, 10th Workshop on the Representation and Processing of Sign Languages: Multilingual Sign Language Resources)

We are releasing a dataset containing videos of both fluent and non-fluent signers using American Sign Language (ASL), which were collected using a Kinect v2 sensor. This dataset was collected as a part of a project to develop and evaluate computer vision algorithms to support new technologies for automatic detection of ASL fluency attributes. A total of 45 fluent and non-fluent participants were asked to perform signing homework assignments that are similar to the assignments used in introductory or intermediate level ASL courses. The data is annotated to identify several aspects of signing including grammatical features and non-manual markers. Sign language recognition is currently very data-driven and this dataset can support the design of recognition technologies, especially technologies that can benefit ASL learners. This dataset might also be interesting to ASL education researchers who want to contrast fluent and non-fluent signing.
more » « less
Full Text Available

« Prev Next »

Search for: All records