skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Deep Learning Methods for Sign Language Translation
Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.  more » « less
Award ID(s):
1846076
PAR ID:
10321194
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Accessible Computing
Volume:
14
Issue:
4
ISSN:
1936-7228
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While a significant amount of work has been done on the commonly used, tightly -constrained weather-based, German sign language (GSL) dataset, little has been done for continuous sign language translation (SLT) in more realistic settings, including American sign language (ASL) translation. Also, while CNN - based features have been consistently shown to work well on the GSL dataset, it is not clear whether such features will work as well in more realistic settings when there are more heterogeneous signers in non-uniform backgrounds. To this end, in this work, we introduce a new, realistic phrase-level ASL dataset (ASLing), and explore the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating it to spoken American English. We propose a novel Transformer-based, visual feature learning method for ASL translation. We demonstrate the explainability efficacy of our proposed learning methods by visualizing activation weights under various input conditions and discover that the body keypoints are consistently the most reliable set of input features. Using our model, we successfully transfer-learn from the larger GSL dataset to ASLing, resulting in significant BLEU score improvements. In summary, this work goes a long way in bringing together the AI resources required for automated ASL translation in unconstrained environments. 
    more » « less
  2. The role of a sign interpreting agent is to bridge the communication gap between the hearing-only and Deaf or Hard of Hearing communities by translating both from sign language to text and from text to sign language. Until now, much of the AI work in automated sign language processing has focused primarily on sign language to text translation, which puts the advantage mainly on the side of hearing individuals. In this work, we describe advances in sign language processing based on transformer networks. Specifically, we introduce SignNet II, a sign language processing architecture, a promising step towards facilitating two-way sign language communication. It is comprised of sign-to-text and text-to-sign networks jointly trained using a dual learning mechanism. Furthermore, by exploiting the notion of sign similarity, a metric embedding learning process is introduced to enhance the text-to-sign translation performance. Using a bank of multi-feature transformers, we analyzed several input feature representations and discovered that keypoint-based pose features consistently performed well, irrespective of the quality of the input videos. We demonstrated that the two jointly trained networks outperformed their singly-trained counterparts, showing noteworthy enhancements in BLEU-1 - BLEU-4 scores when tested on the largest available German Sign Language (GSL) benchmark dataset. 
    more » « less
  3. Calzolari, Nicoletta; Huang, Chu-Ren; Kim, Hansaem; Pustejovsky, James; Wanner, Leo; Choi, Key-Sun; Ryu, Pum-Mo; Chen, Hsin-Hsi; Donatelli, Lucia; Ji, Heng (Ed.)
    "Multilingual neural machine translation (MNMT) jointly trains a shared model for translation with multiple language pairs. However, traditional subword-based MNMT approaches suffer from out-of-vocabulary (OOV) issues and representation bottleneck, which often degrades translation performance on certain language pairs. While byte tokenization is used to tackle the OOV problems in neural machine translation (NMT), until now its capability has not been validated in MNMT. Additionally, existing work has not studied how byte encoding can benefit endangered language translation to our knowledge. We propose a byte-based multilingual neural machine translation system (BMNMT) to alleviate the representation bottleneck and improve translation performance in endangered languages. Furthermore, we design a random byte mapping method with an ensemble prediction to enhance our model robustness. Experimental results show that our BMNMT consistently and significantly outperforms subword/word-based baselines on twelve language pairs up to +18.5 BLEU points, an 840{\%} relative improvement.", 
    more » « less
  4. null (Ed.)
    RF sensing based human activity and hand gesture recognition (HGR) methods have gained enormous popularity with the development of small package, high frequency radar systems and powerful machine learning tools. However, most HGR experiments in the literature have been conducted on individual gestures and in isolation from preceding and subsequent motions. This paper considers the problem of American sign language (ASL) recognition in the context of daily living, which involves sequential classification of a continuous stream of signing mixed with daily activities. In particular, this paper investigates the efficacy of different RF input representations and fusion techniques for ASL and trigger gesture recognition tasks in a daily living scenario, which can be potentially used for sign language sensitive human-computer interfaces (HCI). The proposed approach involves first detecting and segmenting periods of motion, followed by feature level fusion of the range-Doppler map, micro-Doppler spectrogram, and envelope for classification with a bi-directional long short-term memory (BiL-STM) recurrent neural network. Results show 93.3% accuracy in identification of 6 activities and 4 ASL signs, as well as a trigger sign detection rate of 0.93. 
    more » « less
  5. We are releasing a dataset containing videos of both fluent and non-fluent signers using American Sign Language (ASL), which were collected using a Kinect v2 sensor. This dataset was collected as a part of a project to develop and evaluate computer vision algorithms to support new technologies for automatic detection of ASL fluency attributes. A total of 45 fluent and non-fluent participants were asked to perform signing homework assignments that are similar to the assignments used in introductory or intermediate level ASL courses. The data is annotated to identify several aspects of signing including grammatical features and non-manual markers. Sign language recognition is currently very data-driven and this dataset can support the design of recognition technologies, especially technologies that can benefit ASL learners. This dataset might also be interesting to ASL education researchers who want to contrast fluent and non-fluent signing. 
    more » « less