skip to main content


This content will become publicly available on November 27, 2024

Title: DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization
Since American Sign Language (ASL) has no standard written form, Deaf signers frequently share videos in order to communicate in their native language. However, since both hands and face convey critical linguistic information in signed languages, sign language videos cannot preserve signer privacy. While signers have expressed interest, for a variety of applications, in sign language video anonymization that would effectively preserve linguistic content, attempts to develop such technology have had limited success, given the complexity of hand movements and facial expressions. Existing approaches rely predominantly on precise pose estimations of the signer in video footage and often require sign language video datasets for training. These requirements prevent them from processing videos 'in the wild,' in part because of the limited diversity present in current sign language video datasets. To address these limitations, our research introduces DiffSLVA, a novel methodology that utilizes pre-trained large-scale diffusion models for zero-shot text-guided sign language video anonymization. We incorporate ControlNet, which leverages low-level image features such as HED (Holistically-Nested Edge Detection) edges, to circumvent the need for pose estimation. Additionally, we develop a specialized module dedicated to capturing facial expressions, which are critical for conveying essential linguistic information in signed languages. We then combine the above methods to achieve anonymization that better preserves the essential linguistic content of the original signer. This innovative methodology makes possible, for the first time, sign language video anonymization that could be used for real-world applications, which would offer significant benefits to the Deaf and Hard-of-Hearing communities. We demonstrate the effectiveness of our approach with a series of signer anonymization experiments.  more » « less
Award ID(s):
2235405 2212302
NSF-PAR ID:
10475964
Author(s) / Creator(s):
; ;
Publisher / Repository:
arXiv.org
Date Published:
Journal Name:
arXiv
ISSN:
2311.16060
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deaf signers who wish to communicate in their native language frequently share videos on the Web. However, videos cannot preserve privacy—as is often desirable for discussion of sensitive topics—since both hands and face convey critical linguistic information and therefore cannot be obscured without degrading communication. Deaf signers have expressed interest in video anonymization that would preserve linguistic content. However, attempts to develop such technology have thus far shown limited success. We are developing a new method for such anonymization, with input from ASL signers. We modify a motion-based image animation model to generate high-resolution videos with the signer identity changed, but with preservation of linguistically significant motions and facial expressions. An asymmetric encoder-decoder structured image generator is used to generate the high-resolution target frame from the low-resolution source frame based on the optical flow and confidence map. We explicitly guide the model to attain clear generation of hands and face by using bounding boxes to improve the loss computation. FID and KID scores are used for evaluation of the realism of the generated frames. This technology shows great potential for practical applications to benefit deaf signers. 
    more » « less
  2. Deaf signers who wish to communicate in their native language frequently share videos on the Web. However, videos cannot preserve privacy—as is often desirable for discussion of sensitive topics—since both hands and face convey critical linguistic information and therefore cannot be obscured without degrading communication. Deaf signers have expressed interest in video anonymization that would preserve linguistic content. However, attempts to develop such technology have thus far shown limited success. We are developing a new method for such anonymization, with input from ASL signers. We modify a motion-based image animation model to generate high-resolution videos with the signer identity changed, but with preservation of linguistically significant motions and facial expressions. An asymmetric encoder-decoder structured image generator is used to generate the high-resolution target frame from the low-resolution source frame based on the optical flow and confidence map. We explicitly guide the model to attain clear generation of hands and face by using bounding boxes to improve the loss computation. FID and KID scores are used for evaluation of the realism of the generated frames. This technology shows great potential for practical applications to benefit deaf signers. 
    more » « less
  3. The use of virtual humans (i.e., avatars) holds the potential for interactive, automated interaction in domains such as remote communication, customer service, or public announcements. For signed language users, signing avatars could potentially provide accessible content by sharing information in the signer's preferred or native language. As the development of signing avatars has gained traction in recent years, researchers have come up with many different methods of creating signing avatars. The resulting avatars vary widely in their appearance, the naturalness of their movements, and facial expressions—all of which may potentially impact users' acceptance of the avatars. We designed a study to test the effects of these intrinsic properties of different signing avatars while also examining the extent to which people's own language experiences change their responses to signing avatars. We created video stimuli showing individual signs produced by (1) a live human signer (Human), (2) an avatar made using computer-synthesized animation (CS Avatar), and (3) an avatar made using high-fidelity motion capture (Mocap avatar). We surveyed 191 American Sign Language users, including Deaf ( N = 83), Hard-of-Hearing ( N = 34), and Hearing ( N = 67) groups. Participants rated the three signers on multiple dimensions, which were then combined to form ratings of Attitudes, Impressions, Comprehension, and Naturalness. Analyses demonstrated that the Mocap avatar was rated significantly more positively than the CS avatar on all primary variables. Correlations revealed that signers who acquire sign language later in life are more accepting of and likely to have positive impressions of signing avatars. Finally, those who learned ASL earlier were more likely to give lower, more negative ratings to the CS avatar, but we did not see this association for the Mocap avatar or the Human signer. Together, these findings suggest that movement quality and appearance significantly impact users' ratings of signing avatars and show that signed language users with earlier age of ASL acquisition are the most sensitive to movement quality issues seen in computer-generated avatars. We suggest that future efforts to develop signing avatars consider retaining the fluid movement qualities integral to signed languages. 
    more » « less
  4. Without a commonly accepted writing system for American Sign Language (ASL), Deaf or Hard of Hearing (DHH) ASL signers who wish to express opinions or ask questions online must post a video of their signing, if they prefer not to use written English, a language in which they may feel less proficient. Since the face conveys essential linguistic meaning, the face cannot simply be removed from the video in order to preserve anonymity. Thus, DHH ASL signers cannot easily discuss sensitive, personal, or controversial topics in their primary language, limiting engagement in online debate or inquiries about health or legal issues. We explored several recent attempts to address this problem through development of “face swap” technologies to automatically disguise the face in videos while preserving essential facial expressions and natural human appearance. We presented several prototypes to DHH ASL signers (N=16) and examined their interests in and requirements for such technology. After viewing transformed videos of other signers and of themselves, participants evaluated the understandability, naturalness of appearance, and degree of anonymity protection of these technologies. Our study revealed users’ perception of key trade-offs among these three dimensions, factors that contribute to each, and their views on transformation options enabled by this technology, for use in various contexts. Our findings guide future designers of this technology and inform selection of applications and design features. 
    more » « less
  5. Without a commonly accepted writing system for American Sign Language (ASL), Deaf or Hard of Hearing (DHH) ASL signers who wish to express opinions or ask questions online must post a video of their signing, if they prefer not to use written English, a language in which they may feel less proficient. Since the face conveys essential linguistic meaning, the face cannot simply be removed from the video in order to preserve anonymity. Thus, DHH ASL signers cannot easily discuss sensitive, personal, or controversial topics in their primary language, limiting engagement in online debate or inquiries about health or legal issues. We explored several recent attempts to address this problem through development of “face swap” technologies to automatically disguise the face in videos while preserving essential facial expressions and natural human appearance. We presented several prototypes to DHH ASL signers (N=16) and examined their interests in and requirements for such technology. After viewing transformed videos of other signers and of themselves, participants evaluated the understandability, naturalness of appearance, and degree of anonymity protection of these technologies. Our study revealed users’ perception of key trade-offs among these three dimensions, factors that contribute to each, and their views on transformation options enabled by this technology, for use in various contexts. Our findings guide future designers of this technology and inform selection of applications and design features. 
    more » « less