skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Comparative Study of Video-Based Human Representations for American Sign Language Alphabet Generation
Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: twostream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformerbased Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work.  more » « less
Award ID(s):
2223507
PAR ID:
10569916
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-9494-8
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Location:
Istanbul, Turkiye
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results. 
    more » « less
  2. Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page 
    more » « less
  3. null (Ed.)
    Over the years, there has been much research in both wearable and video-based American Sign Language (ASL) recognition systems. However, the restrictive and invasive nature of these sensing modalities remains a significant disadvantage in the context of Deaf-centric smart environments or devices that are responsive to ASL. This paper investigates the efficacy of RF sensors for word-level ASL recognition in support of human-computer interfaces designed for deaf or hard-of-hearing individuals. A principal challenge is the training of deep neural networks given the difficulty in acquiring native ASL signing data. In this paper, adversarial domain adaptation is exploited to bridge the physical/kinematic differences between the copysigning of hearing individuals (repetition of sign motion after viewing a video), and native signing of Deaf individuals who are fluent in sign language. Domain adaptation results are compared with those attained by directly synthesizing ASL signs using generative adversarial networks (GANs). Kinematic improvements to the GAN architecture, such as the insertion of micro-Doppler signature envelopes in a secondary branch of the GAN, are utilized to boost performance. Word-level classification accuracy of 91.3% is achieved for 20 ASL words. 
    more » « less
  4. Many technologies for human-computer interaction have been designed for hearing individuals and depend upon vocalized speech, precluding users of American Sign Language (ASL) in the Deaf community from benefiting from these advancements. While great strides have been made in ASL recognition with video or wearable gloves, the use of video in homes has raised privacy concerns, while wearable gloves severely restrict movement and infringe on daily life. Methods: This paper proposes the use of RF sensors for HCI applications serving the Deaf community. A multi-frequency RF sensor network is used to acquire non-invasive, non-contact measurements of ASL signing irrespective of lighting conditions. The unique patterns of motion present in the RF data due to the micro-Doppler effect are revealed using time-frequency analysis with the Short-Time Fourier Transform. Linguistic properties of RF ASL data are investigated using machine learning (ML). Results: The information content, measured by fractal complexity, of ASL signing is shown to be greater than that of other upper body activities encountered in daily living. This can be used to differentiate daily activities from signing, while features from RF data show that imitation signing by non-signers is 99% differentiable from native ASL signing. Feature-level fusion of RF sensor network data is used to achieve 72.5% accuracy in classification of 20 native ASL signs. Implications: RF sensing can be used to study dynamic linguistic properties of ASL and design Deaf-centric smart environments for non-invasive, remote recognition of ASL. ML algorithms should be benchmarked on native, not imitation, ASL data. 
    more » « less
  5. Abstract The lexical quality hypothesis proposes that the quality of phonological, orthographic, and semantic representations impacts reading comprehension. In Study 1, we evaluated the contributions of lexical quality to reading comprehension in 97 deaf and 98 hearing adults matched for reading ability. While phonological awareness was a strong predictor for hearing readers, for deaf readers, orthographic precision and semantic knowledge, not phonology, predicted reading comprehension (assessed by two different tests). For deaf readers, the architecture of the reading system adapts by shifting reliance from (coarse-grained) phonological representations to high-quality orthographic and semantic representations. In Study 2, we examined the contribution of American Sign Language (ASL) variables to reading comprehension in 83 deaf adults. Fingerspelling (FS) and ASL comprehension skills predicted reading comprehension. We suggest that FS might reinforce orthographic-to-semantic mappings and that sign language comprehension may serve as a linguistic basis for the development of skilled reading in deaf signers. 
    more » « less