skip to main content

Title: Deep Learning Methods for Sign Language Translation
Many sign languages are bona fide natural languages with grammatical rules and lexicons hence can benefit from machine translation methods. Similarly, since sign language is a visual-spatial language, it can also benefit from computer vision methods for encoding it. With the advent of deep learning methods in recent years, significant advances have been made in natural language processing (specifically neural machine translation) and in computer vision methods (specifically image and video captioning). Researchers have therefore begun expanding these learning methods to sign language understanding. Sign language interpretation is especially challenging, because it involves a continuous visual-spatial modality where meaning is often derived based on context. The focus of this article, therefore, is to examine various deep learning–based methods for encoding sign language as inputs, and to analyze the efficacy of several machine translation methods, over three different sign language datasets. The goal is to determine which combinations are sufficiently robust for sign language translation without any gloss-based information. To understand the role of the different input features, we perform ablation studies over the model architectures (input features + neural translation models) for improved continuous sign language translation. These input features include body and finger joints, facial points, as well as vector representations/embeddings from convolutional neural networks. The machine translation models explored include several baseline sequence-to-sequence approaches, more complex and challenging networks using attention, reinforcement learning, and the transformer model. We implement the translation methods over multiple sign languages—German (GSL), American (ASL), and Chinese sign languages (CSL). From our analysis, the transformer model combined with input embeddings from ResNet50 or pose-based landmark features outperformed all the other sequence-to-sequence models by achieving higher BLEU2-BLEU4 scores when applied to the controlled and constrained GSL benchmark dataset. These combinations also showed significant promise on the other less controlled ASL and CSL datasets.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Accessible Computing
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While a significant amount of work has been done on the commonly used, tightly -constrained weather-based, German sign language (GSL) dataset, little has been done for continuous sign language translation (SLT) in more realistic settings, including American sign language (ASL) translation. Also, while CNN - based features have been consistently shown to work well on the GSL dataset, it is not clear whether such features will work as well in more realistic settings when there are more heterogeneous signers in non-uniform backgrounds. To this end, in this work, we introduce a new, realistic phrase-level ASL dataset (ASLing), and explore the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating it to spoken American English. We propose a novel Transformer-based, visual feature learning method for ASL translation. We demonstrate the explainability efficacy of our proposed learning methods by visualizing activation weights under various input conditions and discover that the body keypoints are consistently the most reliable set of input features. Using our model, we successfully transfer-learn from the larger GSL dataset to ASLing, resulting in significant BLEU score improvements. In summary, this work goes a long way in bringing together the AI resources required for automated ASL translation in unconstrained environments. 
    more » « less
  2. Raynal, Ann M. ; Ranney, Kenneth I. (Ed.)
    Most research in technologies for the Deaf community have focused on translation using either video or wearable devices. Sensor-augmented gloves have been reported to yield higher gesture recognition rates than camera-based systems; however, they cannot capture information expressed through head and body movement. Gloves are also intrusive and inhibit users in their pursuit of normal daily life, while cameras can raise concerns over privacy and are ineffective in the dark. In contrast, RF sensors are non-contact, non-invasive and do not reveal private information even if hacked. Although RF sensors are unable to measure facial expressions or hand shapes, which would be required for complete translation, this paper aims to exploit near real-time ASL recognition using RF sensors for the design of smart Deaf spaces. In this way, we hope to enable the Deaf community to benefit from advances in technologies that could generate tangible improvements in their quality of life. More specifically, this paper investigates near real-time implementation of machine learning and deep learning architectures for the purpose of sequential ASL signing recognition. We utilize a 60 GHz RF sensor which transmits a frequency modulation continuous wave (FMWC waveform). RF sensors can acquire a unique source of information that is inaccessible to optical or wearable devices: namely, a visual representation of the kinematic patterns of motion via the micro-Doppler signature. Micro-Doppler refers to frequency modulations that appear about the central Doppler shift, which are caused by rotational or vibrational motions that deviate from principle translational motion. In prior work, we showed that fractal complexity computed from RF data could be used to discriminate signing from daily activities and that RF data could reveal linguistic properties, such as coarticulation. We have also shown that machine learning can be used to discriminate with 99% accuracy the signing of native Deaf ASL users from that of copysigning (or imitation signing) by hearing individuals. Therefore, imitation signing data is not effective for directly training deep models. But, adversarial learning can be used to transform imitation signing to resemble native signing, or, alternatively, physics-aware generative models can be used to synthesize ASL micro-Doppler signatures for training deep neural networks. With such approaches, we have achieved over 90% recognition accuracy of 20 ASL signs. In natural environments, however, near real-time implementations of classification algorithms are required, as well as an ability to process data streams in a continuous and sequential fashion. In this work, we focus on extensions of our prior work towards this aim, and compare the efficacy of various approaches for embedding deep neural networks (DNNs) on platforms such as a Raspberry Pi or Jetson board. We examine methods for optimizing the size and computational complexity of DNNs for embedded micro-Doppler analysis, methods for network compression, and their resulting sequential ASL recognition performance. 
    more » « less
  3. The role of a sign interpreting agent is to bridge the communication gap between the hearing-only and Deaf or Hard of Hearing communities by translating both from sign language to text and from text to sign language. Until now, much of the AI work in automated sign language processing has focused primarily on sign language to text translation, which puts the advantage mainly on the side of hearing individuals. In this work, we describe advances in sign language processing based on transformer networks. Specifically, we introduce SignNet II, a sign language processing architecture, a promising step towards facilitating two-way sign language communication. It is comprised of sign-to-text and text-to-sign networks jointly trained using a dual learning mechanism. Furthermore, by exploiting the notion of sign similarity, a metric embedding learning process is introduced to enhance the text-to-sign translation performance. Using a bank of multi-feature transformers, we analyzed several input feature representations and discovered that keypoint-based pose features consistently performed well, irrespective of the quality of the input videos. We demonstrated that the two jointly trained networks outperformed their singly-trained counterparts, showing noteworthy enhancements in BLEU-1 - BLEU-4 scores when tested on the largest available German Sign Language (GSL) benchmark dataset. 
    more » « less
  4. null (Ed.)
    RF sensing based human activity and hand gesture recognition (HGR) methods have gained enormous popularity with the development of small package, high frequency radar systems and powerful machine learning tools. However, most HGR experiments in the literature have been conducted on individual gestures and in isolation from preceding and subsequent motions. This paper considers the problem of American sign language (ASL) recognition in the context of daily living, which involves sequential classification of a continuous stream of signing mixed with daily activities. In particular, this paper investigates the efficacy of different RF input representations and fusion techniques for ASL and trigger gesture recognition tasks in a daily living scenario, which can be potentially used for sign language sensitive human-computer interfaces (HCI). The proposed approach involves first detecting and segmenting periods of motion, followed by feature level fusion of the range-Doppler map, micro-Doppler spectrogram, and envelope for classification with a bi-directional long short-term memory (BiL-STM) recurrent neural network. Results show 93.3% accuracy in identification of 6 activities and 4 ASL signs, as well as a trigger sign detection rate of 0.93. 
    more » « less
  5. Objectively differentiating patient mental states based on electrical activity, as opposed to overt behavior, is a fundamental neuroscience problem with medical applications, such as identifying patients in locked-in state vs. coma. Electroencephalography (EEG), which detects millisecond-level changes in brain activity across a range of frequencies, allows for assessment of external stimulus processing by the brain in a non-invasive manner. We applied machine learning methods to 26-channel EEG data of 24 fluent Deaf signers watching videos of sign language sentences (comprehension condition), and the same videos reversed in time (non-comprehension condition), to objectively separate vision-based high-level cognition states. While spectrotemporal parameters of the stimuli were identical in comprehension vs. non-comprehension conditions, the neural responses of participants varied based on their ability to linguistically decode visual data. We aimed to determine which subset of parameters (specific scalp regions or frequency ranges) would be necessary and sufficient for high classification accuracy of comprehension state. Optical flow, characterizing distribution of velocities of objects in an image, was calculated for each pixel of stimulus videos using MATLAB Vision toolbox. Coherence between optical flow in the stimulus and EEG neural response (per video, per participant) was then computed using canonical component analysis with NoiseTools toolbox. Peak correlations were extracted for each frequency for each electrode, participant, and video. A set of standard ML algorithms were applied to the entire dataset (26 channels, frequencies from .2 Hz to 12.4 Hz, binned in 1 Hz increments), with consistent out-of-sample 100% accuracy for frequencies in .2-1 Hz range for all regions, and above 80% accuracy for frequencies < 4 Hz. Sparse Optimal Scoring (SOS) was then applied to the EEG data to reduce the dimensionality of the features and improve model interpretability. SOS with elastic-net penalty resulted in out-of-sample classification accuracy of 98.89%. The sparsity pattern in the model indicated that frequencies between 0.2–4 Hz were primarily used in the classification, suggesting that underlying data may be group sparse. Further, SOS with group lasso penalty was applied to regional subsets of electrodes (anterior, posterior, left, right). All trials achieved greater than 97% out-of-sample classification accuracy. The sparsity patterns from the trials using 1 Hz bins over individual regions consistently indicated frequencies between 0.2–1 Hz were primarily used in the classification, with anterior and left regions performing the best with 98.89% and 99.17% classification accuracy, respectively. While the sparsity pattern may not be the unique optimal model for a given trial, the high classification accuracy indicates that these models have accurately identified common neural responses to visual linguistic stimuli. Cortical tracking of spectro-temporal change in the visual signal of sign language appears to rely on lower frequencies proportional to the N400/P600 time-domain evoked response potentials, indicating that visual language comprehension is grounded in predictive processing mechanisms. 
    more » « less