In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results.
more »
« less
Creating questionnaires that align with ASL linguistic principles and cultural practices within the Deaf community
Conducting human-centered research by, with, and for the ASL-signing Deaf community, requires rethinking current human-computer interaction processes in order to meet their linguistic and cultural needs and expectations. This paper highlights some key considerations that emerged in our work creating an ASL-based questionnaire, and our recommendations for handling them.
more »
« less
- Award ID(s):
- 1901026
- PAR ID:
- 10587835
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9781450371032
- Page Range / eLocation ID:
- 1 to 4
- Subject(s) / Keyword(s):
- questionnaires American Sign Language ASL
- Format(s):
- Medium: X
- Location:
- Virtual Event Greece
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: twostream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformerbased Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work.more » « less
-
Previous research underscored the potential of danmaku–a text-based commenting feature on videos–in engaging hearing audiences. Yet, for many Deaf and hard-of-hearing (DHH) individuals, American Sign Language (ASL) takes precedence over English. To improve inclusivity, we introduce “Signmaku,” a new commenting mechanism that uses ASL, serving as a sign language counterpart to danmaku. Through a need-finding study (N=12) and a within-subject experiment (N=20), we evaluated three design styles: real human faces, cartoon-like figures, and robotic representations. The results showed that cartoon-like signmaku not only entertained but also encouraged participants to create and share ASL comments, with fewer privacy concerns compared to the other designs. Conversely, the robotic representations faced challenges in accurately depicting hand movements and facial expressions, resulting in higher cognitive demands on users. Signmaku featuring real human faces elicited the lowest cognitive load and was the most comprehensible among all three types. Our findings offered novel design implications for leveraging generative AI to create signmaku comments, enriching co-learning experiences for DHH individuals.more » « less
-
Abstract Over the past decade, there have been great advancements in radio frequency sensor technology for human–computer interaction applications, such as gesture recognition, and human activity recognition more broadly. While there is a significant amount of study on these topics, in most cases, experimental data are acquired in controlled settings by directing participants what motion to articulate. However, especially for communicative motions, such as sign language, such directed data sets do not accurately capture natural, in situ articulations. This results in a difference in the distribution of directed American Sign Language (ASL) versus natural ASL, which severely degrades natural sign language recognition in real‐world scenarios. To overcome these challenges and acquire more representative data for training deep models, the authors develop an interactive gaming environment, ChessSIGN, which records video and radar data of participants as they play the gamewithout any external direction. The authors investigate various ways of generating synthetic samples from directed ASL data, but show that ultimately such data does not offer much improvement over just initialising using imagery from ImageNet. In contrast, an interactive learning paradigm is proposed by the authors in which model training is shown to improve as more and more natural ASL samples are acquired and augmented via synthetic samples generated from a physics‐aware generative adversarial network. The authors show that the proposed approach enables the recognition of natural ASL in a real‐world setting, achieving an accuracy of 69% for 29 ASL signs—a 60% improvement over conventional training with directed ASL data.more » « less
-
Sign words are the building blocks of any sign language. In this work, we present wSignGen, a word-conditioned 3D American Sign Language (ASL) generation model dedicated to synthesizing realistic and grammatically accurate motion sequences for sign words. Our approach leverages a transformer-based diffusion model, trained on a curated dataset of 3D motion meshes from word-level ASL videos. By integrating CLIP, wSignGen offers two advantages: image-based generation, which is particularly useful for children learning sign language but not yet able to read, and the ability to generalize to unseen synonyms. Experiments demonstrate that wSignGen significantly outperforms the baseline model in the task of sign word generation. Moreover, human evaluation experiments show that wSignGen can generate high-quality, grammatically correct ASL signs effectively conveyed through 3D avatars.more » « less
An official website of the United States government

