skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multi-Modal Multi-Channel American Sign Language Recognition
In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results.  more » « less
Award ID(s):
2041307
PAR ID:
10539424
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Galaxy Artificial Intelligence and Robotics Research Institute and World Scientific Publishing Company
Date Published:
Journal Name:
International Journal of Artificial Intelligence and Robotics Research
Volume:
01
Issue:
01
ISSN:
2972-3353
Subject(s) / Keyword(s):
American Sign Language recognition hand gesture recognition RGB-D video analysis multi-modality 3D Convolutional Neural Networks proxy video
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. {"Abstract":["The PoseASL dataset consists of color and depth videos collected from ASL signers at the Linguistic and Assistive Technologies Laboratory under the direction of Matt Huenerfauth, as part of a collaborative research project with researchers at the Rochester Institute of Technology, Boston University, and the University of Pennsylvania.\n\nAccess: After becoming an authorized user of Databrary, please contact Matt Huenerfauth if you have difficulty accessing this volume. \n\nWe have collected a new dataset consisting of color and depth videos of fluent American Sign Language signers performing sequences ASL signs and sentences. Given interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the video files, we share depth data files from a Kinect v2 sensor, as well as additional motion-tracking files produced through post-processing of this data.\n\nOrganization of the Dataset: The dataset is organized into sub-folders, with codenames such as "P01" or "P16" etc. These codenames refer to specific human signers who were recorded in this dataset. Please note that there was no participant P11 nor P14; those numbers were accidentally skipped during the process of making appointments to collect video stimuli.\n\nTask: During the recording session, the participant was met by a member of our research team who was a native ASL signer. No other individuals were present during the data collection session. After signing the informed consent and video release document, participants responded to a demographic questionnaire. Next, the data-collection session consisted of English word stimuli and cartoon videos. The recording session began with showing participants stimuli consisting of slides that displayed English word and photos of items, and participants were asked to produce the sign for each (PDF included in materials subfolder). Next, participants viewed three videos of short animated cartoons, which they were asked to recount in ASL:\n- Canary Row, Warner Brothers Merrie Melodies 1950 (the 7-minute video divided into seven parts)\n- Mr. Koumal Flies Like a Bird, Studio Animovaneho Filmu 1969\n- Mr. Koumal Battles his Conscience, Studio Animovaneho Filmu 1971\nThe word list and cartoons were selected as they are identical to the stimuli used in the collection of the Nicaraguan Sign Language video corpora - see: Senghas, A. (1995). Children\u2019s Contribution to the Birth of Nicaraguan Sign Language. Doctoral dissertation, Department of Brain and Cognitive Sciences, MIT.\n\nDemographics: All 14 of our participants were fluent ASL signers. As screening, we asked our participants: Did you use ASL at home growing up, or did you attend a school as a very young child where you used ASL? All the participants responded affirmatively to this question. A total of 14 DHH participants were recruited on the Rochester Institute of Technology campus. Participants included 7 men and 7 women, aged 21 to 35 (median = 23.5). All of our participants reported that they began using ASL when they were 5 years old or younger, with 8 reporting ASL use since birth, and 3 others reporting ASL use since age 18 months. \n\nFiletypes:\n\n*.avi, *_dep.bin: The PoseASL dataset has been captured by using a Kinect 2.0 RGBD camera. The output of this camera system includes multiple channels which include RGB, depth, skeleton joints (25 joints for every video frame), and HD face (1,347 points). The video resolution produced in 1920 x 1080 pixels for the RGB channel and 512 x 424 pixels for the depth channels respectively. Due to limitations in the acceptable filetypes for sharing on Databrary, it was not permitted to share binary *_dep.bin files directly produced by the Kinect v2 camera system on the Databrary platform. If your research requires the original binary *_dep.bin files, then please contact Matt Huenerfauth.\n\n*_face.txt, *_HDface.txt, *_skl.txt: To make it easier for future researchers to make use of this dataset, we have also performed some post-processing of the Kinect data. To extract the skeleton coordinates of the RGB videos, we used the Openpose system, which is capable of detecting body, hand, facial, and foot keypoints of multiple people on single images in real time. The output of Openpose includes estimation of 70 keypoints for the face including eyes, eyebrows, nose, mouth and face contour. The software also estimates 21 keypoints for each of the hands (Simon et al, 2017), including 3 keypoints for each finger, as shown in Figure 2. Additionally, there are 25 keypoints estimated for the body pose (and feet) (Cao et al, 2017; Wei et al, 2016).\n\nReporting Bugs or Errors:\n\nPlease contact Matt Huenerfauth to report any bugs or errors that you identify in the corpus. We appreciate your help in improving the quality of the corpus over time by identifying any errors.\n\nAcknowledgement: This material is based upon work supported by the National Science Foundation under award 1749376: "Collaborative Research: Multimethod Investigation of Articulatory and Perceptual Constraints on Natural Language Evolution.""]} 
    more » « less
  2. Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: twostream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformerbased Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work. 
    more » « less
  3. We describe our methodology for classifying ASL (American Sign Language) gestures. Rather than operate directly on raw images of hand gestures, we extract coor-dinates and render wireframes from individual images to construct a curated training dataset. This dataset is then used in a classifier that is memory efficient and provides effective performance (94% accuracy). Because we con-struct wireframes that contain information about several angles in the joints that comprise hands, our methodolo-gy is amenable to training those interested in learning ASL by identifying targeted errors in their hand gestures. 
    more » « less
  4. Efthimiou, Eleni; Fotinea, Stavroula-Evita; Hanke, Thomas; Hochgesang, Julie A; Mesch, Johanna; Schulder, Marc (Ed.)
    We propose a multimodal network using skeletons and handshapes as input to recognize individual signs and detect their boundaries in American Sign Language (ASL) videos. Our method integrates a spatio-temporal Graph Convolutional Network (GCN) architecture to estimate human skeleton keypoints; it uses a late-fusion approach for both forward and backward processing of video streams. Our (core) method is designed for the extraction---and analysis of features from---ASL videos, to enhance accuracy and efficiency of recognition of individual signs. A Gating module based on per-channel multi-layer convolutions is employed to evaluate significant frames for recognition of isolated signs. Additionally, an auxiliary multimodal branch network, integrated with a transformer, is designed to estimate the linguistic start and end frames of an isolated sign within a video clip. We evaluated performance of our approach on multiple datasets that include isolated, citation-form signs and signs pre-segmented from continuous signing based on linguistic annotations of start and end points of signs within sentences. We have achieved very promising results when using both types of sign videos combined for training, with overall sign recognition accuracy of 80.8% Top-1 and 95.2% Top-5 for citation-form signs, and 80.4% Top-1 and 93.0% Top-5 for signs pre-segmented from continuous signing. 
    more » « less
  5. The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream(R). The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream(R) (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article. 
    more » « less