skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improving Sign Recognition with Phonology
We use insights from research on American Sign Language (ASL) phonology to train models for isolated sign language recognition (ISLR), a step towards automatic sign language understanding. Our key insight is to explicitly recognize the role of phonology in sign production to achieve more accurate ISLR than existing work which does not consider sign language phonology. We train ISLR models that take in pose estimations of a signer producing a single sign to predict not only the sign but additionally its phonological characteristics, such as the handshape. These auxiliary predictions lead to a nearly 9% absolute gain in sign recognition accuracy on the WLASL benchmark, with consistent improvements in ISLR regardless of the underlying prediction model architecture. This work has the potential to accelerate linguistic research in the domain of signed languages and reduce communication barriers between deaf and hearing people.  more » « less
Award ID(s):
1918556
PAR ID:
10481697
Author(s) / Creator(s):
; ;
Editor(s):
Vlachos, Andreas; Augenstein, Isabelle
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Journal Name:
Association for Computational Linguistics
Page Range / eLocation ID:
2732 to 2737
Subject(s) / Keyword(s):
American Sign Language, ASL, computational linguistics, isolated sign recognition, machine learning
Format(s):
Medium: X
Location:
Dubrovnik, Croatia
Sponsoring Org:
National Science Foundation
More Like this
  1. Research shows that insufficient language access in early childhood significantly affects language processing. While the majority of this work focuses on syntax, phonology also appears to be affected, though it is unclear exactly how. Here we investigated phonological production across age of acquisition of American Sign Language (ASL). Participants were deaf adult signers who first learned ASL at ages ranging from birth to 14 years and they performed both lexical decisions and repetitions of ASL signs and pseudosigns. Because phonological production has been understudied across age of acquisition, we were particularly interested in production accuracy for the sublexical phonological parameters of handshape, movement, and location. Lexical decision responses were slower and more accurate for impossible pseudosigns compared with possible pseudosigns, indicating participants were sensitive to ASL phonological structure regardless of age of acquisition. Despite this, age of acquisition affected repetition accuracy. Handshape errors were highest for those with earlier ages of acquisition, but movement errors were highest for those with later ages of acquisition, though this effect of age of acquisition was only seen for real ASL signs and not pseudosigns. The parameter error pattern for pseudosigns was not affected by age of acquisition. These results indicate that later age of acquisition does not inhibit the ability to produce ASL phonology but ultimately alters the processing of the phonological parameters when meaning and phonology are integrated. 
    more » « less
  2. Several attacks have been proposed against autonomous vehicles and their subsystems that are powered by machine learning (ML). Road sign recognition models are especially heavily tested under various adversarial ML attack settings, and they have proven to be vulnerable. Despite the increasing research on adversarial ML attacks against road sign recognition models, there is little to no focus on defending against these attacks. In this paper, we propose the first defense method specifically designed for autonomous vehicles to detect adversarial ML attacks targeting road sign recognition models, which is called ViLAS (Vision-Language Model for Adversarial Traffic Sign Detection). The proposed defense method is based on a custom, fast, lightweight, and salable vision-language model (VLM) and is compatible with any existing traffic sign recognition system. Thanks to the orthogonal information coming from the class label text data through the language model, ViLAS leverages image context in addition to visual data for highly effective attack detection performance. In our extensive experiments, we show that our method consistently detects various attacks against different target models with high true positive rates while satisfying very low false positive rates. When tested against four state-of-the-art attacks targeting four popular action recognition models, our proposed detector achieves an average AUC of 0.94. This result achieves a 25.3% improvement over a state-of-the-art defense method proposed for generic image attack detection, which attains an average AUC of 0.75. We also show that our custom VLM is more suitable for an autonomous vehicle compared to the popular off-the-shelf VLM and CLIP in terms of speed (4.4 vs. 9.3 milliseconds), space complexity (0.36 vs. 1.6 GB), and performance (0.94 vs. 0.43 average AUC). 
    more » « less
  3. We present a new approach for isolated sign recognition, which combines a spatial-temporal Graph Convolution Network (GCN) architecture for modeling human skeleton keypoints with late fusion of both the forward and backward video streams, and we explore the use of curriculum learning. We employ a type of curriculum learning that dynamically estimates, during training, the order of difficulty of each input video for sign recognition; this involves learning a new family of data parameters that are dynamically updated during training. The research makes use of a large combined video dataset for American Sign Language (ASL), including data from both the American Sign Language Lexicon Video Dataset (ASLLVD) and the Word-Level American Sign Language (WLASL) dataset, with modified gloss labeling of the latter—to ensure 1-1 correspondence between gloss labels and distinct sign productions, as well as consistency in gloss labeling across the two datasets. This is the first time that these two datasets have been used in combination for isolated sign recognition research. We also compare the sign recognition performance on several different subsets of the combined dataset, varying in, e.g., the minimum number of samples per sign (and therefore also in the total number of sign classes and video examples). 
    more » « less
  4. We present a new approach for isolated sign recognition, which combines a spatial-temporal Graph Convolution Network (GCN) architecture for modeling human skeleton keypoints with late fusion of both the forward and backward video streams, and we explore the use of curriculum learning. We employ a type of curriculum learning that dynamically estimates, during training, the order of difficulty of each input video for sign recognition; this involves learning a new family of data parameters that are dynamically updated during training. The research makes use of a large combined video dataset for American Sign Language (ASL), including data from both the American Sign Language Lexicon Video Dataset (ASLLVD) and the Word-Level American Sign Language (WLASL) dataset, with modified gloss labeling of the latter—to ensure 1-1 correspondence between gloss labels and distinct sign productions, as well as consistency in gloss labeling across the two datasets. This is the first time that these two datasets have been used in combination for isolated sign recognition research. We also compare the sign recognition performance on several different subsets of the combined dataset, varying in, e.g., the minimum number of samples per sign (and therefore also in the total number of sign classes and video examples). 
    more » « less
  5. Like speech, signs are composed of discrete, recombinable features called phonemes. Prior work shows that models which can recognize phonemes are better at sign recognition, motivating deeper exploration into strategies for modeling sign language phonemes. In this work, we learn graph convolution networks to recognize the sixteen phoneme “types” found in ASL-LEX2.0. Specifically, we explore how learning strategies like multi-task and curriculum learning can leverage mutually useful information between phoneme types to facilitate the remodeling of sign language phonemes. Results on the Sem-Lex Benchmark show that curriculum learning yields an average accuracy of 87% across all phoneme types, outperforming fine-tuning and multi-task strategies for most phonemetypes. 
    more » « less