People learning American Sign Language (ASL) and practicing their comprehension skills will often encounter complex ASL videos that may contain unfamiliar signs. Existing dictionary tools require users to isolate a single unknown sign before initiating a search by selecting linguistic properties or performing the sign in front of a webcam. This process presents challenges in extracting and reproducing unfamiliar signs, disrupting the video-watching experience, and requiring learners to rely on external dictionaries. We explore a technology that allows users to select and view dictionary results for one or more unfamiliar signs while watching a video. We interviewed 14 ASL learners to understand their challenges in understanding ASL videos, strategies for dealing with unfamiliar vocabulary, and expectations for anin situdictionary system. We then conducted an in-depth analysis with eight learners to examine their interactions with a Wizard-of-Oz prototype during a video comprehension task. Finally, we conducted a comparative study with six additional ASL learners to evaluate the speed, accuracy, and workload benefits of an embedded dictionary-search feature within a video player. Our tool outperformed a baseline in the form of an existing online dictionary across all three metrics. The integration of a search tool and span selection offered advantages for video comprehension. Our findings have implications for designers, computer vision researchers, and sign language educators. 
                        more » 
                        « less   
                    
                            
                            ASL Citizen: A Community-Sourced Dataset for Advancing Isolated Sign Language Recognition
                        
                    
    
            Sign languages are used as a primary language by approximately 70 million D/deaf people world-wide. However, most communication technologies operate in spoken and written languages, creating inequities in access. To help tackle this problem, we release ASL Citizen, the first crowdsourced Isolated Sign Language Recognition (ISLR) dataset, collected with consent and containing 83,399 videos for 2,731 distinct signs filmed by 52 signers in a variety of environments. We propose that this dataset be used for sign language dictionary retrieval for American Sign Language (ASL), where a user demonstrates a sign to their webcam to retrieve matching signs from a dictionary. Through our generalizable baselines, we show that training supervised machine learning classifiers with our dataset achieves competitive performance on metrics relevant for dictionary retrieval, with 63% accuracy and a recall-at-10 of 91%, evaluated entirely on videos of users who are not present in the training or validation sets. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2234787
- PAR ID:
- 10535436
- Publisher / Repository:
- 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks
- Date Published:
- Format(s):
- Medium: X
- Location:
- 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results.more » « less
- 
            Sign words are the building blocks of any sign language. In this work, we present wSignGen, a word-conditioned 3D American Sign Language (ASL) generation model dedicated to synthesizing realistic and grammatically accurate motion sequences for sign words. Our approach leverages a transformer-based diffusion model, trained on a curated dataset of 3D motion meshes from word-level ASL videos. By integrating CLIP, wSignGen offers two advantages: image-based generation, which is particularly useful for children learning sign language but not yet able to read, and the ability to generalize to unseen synonyms. Experiments demonstrate that wSignGen significantly outperforms the baseline model in the task of sign word generation. Moreover, human evaluation experiments show that wSignGen can generate high-quality, grammatically correct ASL signs effectively conveyed through 3D avatars.more » « less
- 
            Picture-naming tasks provide critical data for theories of lexical representation and retrieval and have been performed successfully in sign languages. However, the specific influences of lexical or phonological factors and stimulus properties on sign retrieval are poorly understood. To examine lexical retrieval in American Sign Language (ASL), we conducted a timed picture-naming study using 524 pictures (272 objects and 251 actions). We also compared ASL naming with previous data for spoken English for a subset of 425 pictures. Deaf ASL signers named object pictures faster and more consistently than action pictures, as previously reported for English speakers. Lexical frequency, iconicity, better name agreement, and lower phonological complexity each facilitated naming reaction times (RT)s. RTs were also faster for pictures named with shorter signs (measured by average response duration). Target name agreement was higher for pictures with more iconic and shorter ASL names. The visual complexity of pictures slowed RTs and decreased target name agreement. RTs and target name agreement were correlated for ASL and English, but agreement was lower for ASL, possibly due to the English bias of the pictures. RTs were faster for ASL, which we attributed to a smaller lexicon. Overall, the results suggest that models of lexical retrieval developed for spoken languages can be adopted for signed languages, with the exception that iconicity should be included as a factor. The open-source picture-naming data set for ASL serves as an important, first-of-its-kind resource for researchers, educators, or clinicians for a variety of research, instructional, or assessment purposes.more » « less
- 
            Efthimiou, Eleni; Fotinea, Stavroula-Evita; Hanke, Thomas; Hochgesang, Julie A; Mesch, Johanna; Schulder, Marc (Ed.)We propose a multimodal network using skeletons and handshapes as input to recognize individual signs and detect their boundaries in American Sign Language (ASL) videos. Our method integrates a spatio-temporal Graph Convolutional Network (GCN) architecture to estimate human skeleton keypoints; it uses a late-fusion approach for both forward and backward processing of video streams. Our (core) method is designed for the extraction---and analysis of features from---ASL videos, to enhance accuracy and efficiency of recognition of individual signs. A Gating module based on per-channel multi-layer convolutions is employed to evaluate significant frames for recognition of isolated signs. Additionally, an auxiliary multimodal branch network, integrated with a transformer, is designed to estimate the linguistic start and end frames of an isolated sign within a video clip. We evaluated performance of our approach on multiple datasets that include isolated, citation-form signs and signs pre-segmented from continuous signing based on linguistic annotations of start and end points of signs within sentences. We have achieved very promising results when using both types of sign videos combined for training, with overall sign recognition accuracy of 80.8% Top-1 and 95.2% Top-5 for citation-form signs, and 80.4% Top-1 and 93.0% Top-5 for signs pre-segmented from continuous signing.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    