skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Electromyogram‐Based Lip‐Reading via Unobtrusive Dry Electrodes and Machine Learning Methods
Abstract Lip‐reading provides an effective speech communication interface for people with voice disorders and for intuitive human–machine interactions. Existing systems are generally challenged by bulkiness, obtrusiveness, and poor robustness against environmental interferences. The lack of a truly natural and unobtrusive system for converting lip movements to speech precludes the continuous use and wide‐scale deployment of such devices. Here, the design of a hardware–software architecture to capture, analyze, and interpret lip movements associated with either normal or silent speech is presented. The system can recognize different and similar visemes. It is robust in a noisy or dark environment. Self‐adhesive, skin‐conformable, and semi‐transparent dry electrodes are developed to track high‐fidelity speech‐relevant electromyogram signals without impeding daily activities. The resulting skin‐like sensors can form seamless contact with the curvilinear and dynamic surfaces of the skin, which is crucial for a high signal‐to‐noise ratio and minimal interference. Machine learning algorithms are employed to decode electromyogram signals and convert them to spoken words. Finally, the applications of the developed lip‐reading system in augmented reality and medical service are demonstrated, which illustrate the great potential in immersive interaction and healthcare applications.  more » « less
Award ID(s):
2129673
PAR ID:
10409521
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Small
Volume:
19
Issue:
17
ISSN:
1613-6810
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Silent speech interfaces have been pursued to restore spoken communication for individuals with voice disorders and to facilitate intuitive communications when acoustic-based speech communication is unreliable, inappropriate, or undesired. However, the current methodology for silent speech faces several challenges, including bulkiness, obtrusiveness, low accuracy, limited portability, and susceptibility to interferences. In this work, we present a wireless, unobtrusive, and robust silent speech interface for tracking and decoding speech-relevant movements of the temporomandibular joint. Our solution employs a single soft magnetic skin placed behind the ear for wireless and socially acceptable silent speech recognition. The developed system alleviates several concerns associated with existing interfaces based on face-worn sensors, including a large number of sensors, highly visible interfaces on the face, and obtrusive interconnections between sensors and data acquisition components. With machine learning-based signal processing techniques, good speech recognition accuracy is achieved (93.2% accuracy for phonemes, and 87.3% for a list of words from the same viseme groups). Moreover, the reported silent speech interface demonstrates robustness against noises from both ambient environments and users’ daily motions. Finally, its potential in assistive technology and human–machine interactions is illustrated through two demonstrations – silent speech enabled smartphone assistants and silent speech enabled drone control. 
    more » « less
  2. Human speech perception is generally optimal in quiet environments, however it becomes more difficult and error prone in the presence of noise, such as other humans speaking nearby or ambient noise. In such situations, human speech perception is improved by speech reading , i.e., watching the movements of a speaker's mouth and face, either consciously as done by people with hearing loss or subconsciously by other humans. While previous work focused largely on speech perception of two-dimensional videos of faces, there is a gap in the research field focusing on facial features as seen in head-mounted displays, including the impacts of display resolution, and the effectiveness of visually enhancing a virtual human face on speech perception in the presence of noise. In this paper, we present a comparative user study ( $N=21$ ) in which we investigated an audio-only condition compared to two levels of head-mounted display resolution ( $$1832\times 1920$$ or $$916\times 960$$ pixels per eye) and two levels of the native or visually enhanced appearance of a virtual human, the latter consisting of an up-scaled facial representation and simulated lipstick (lip coloring) added to increase contrast. To understand effects on speech perception in noise, we measured participants' speech reception thresholds (SRTs) for each audio-visual stimulus condition. These thresholds indicate the decibel levels of the speech signal that are necessary for a listener to receive the speech correctly 50% of the time. First, we show that the display resolution significantly affected participants' ability to perceive the speech signal in noise, which has practical implications for the field, especially in social virtual environments. Second, we show that our visual enhancement method was able to compensate for limited display resolution and was generally preferred by participants. Specifically, our participants indicated that they benefited from the head scaling more than the added facial contrast from the simulated lipstick. We discuss relationships, implications, and guidelines for applications that aim to leverage such enhancements. 
    more » « less
  3. Human speech perception is generally optimal in quiet environments, however it becomes more difficult and error prone in the presence of noise, such as other humans speaking nearby or ambient noise. In such situations, human speech perception is improved by speech reading, i.e., watching the movements of a speaker’s mouth and face, either consciously as done by people with hearing loss or subconsciously by other humans. While previous work focused largely on speech perception of two-dimensional videos of faces, there is a gap in the research field focusing on facial features as seen in head-mounted displays, including the impacts of display resolution, and the effectiveness of visually enhancing a virtual human face on speech perception in the presence of noise. In this paper, we present a comparative user study (N = 21) in which we investigated an audio-only condition compared to two levels of head-mounted display resolution (1832×1920 or 916×960 pixels per eye) and two levels of the native or visually enhanced appearance of a virtual human, the latter consisting of an up-scaled facial representation and simulated lipstick (lip coloring) added to increase contrast. To understand effects on speech perception in noise, we measured participants’ speech reception thresholds (SRTs) for each audio-visual stimulus condition. These thresholds indicate the decibel levels of the speech signal that are necessary for a listener to receive the speech correctly 50% of the time. First, we show that the display resolution significantly affected participants’ ability to perceive the speech signal in noise, which has practical implications for the field, especially in social virtual environments. Second, we show that our visual enhancement method was able to compensate for limited display resolution and was generally preferred by participants. Specifically, our participants indicated that they benefited from the head scaling more than the added facial contrast from the simulated lipstick. We discuss relationships, implications, and guidelines for applications that aim to leverage such enhancements. 
    more » « less
  4. null (Ed.)
    A loss of individuated finger movement affects critical aspects of daily activities. There is a need to develop neural-machine interface techniques that can continuously decode single finger movements. In this preliminary study, we evaluated a novel decoding method that used finger-specific motoneuron firing frequency to estimate joint kinematics and fingertip forces. High-density electromyogram (EMG) signals were obtained during which index or middle fingers produced either dynamic flexion movements or isometric flexion forces. A source separation method was used to extract motor unit (MU) firing activities from a single trial. A separate validation trial was used to only retain the MUs associated with a particular finger. The finger-specific MU firing activities were then used to estimate individual finger joint angles and isometric forces in a third trial using a regression method. Our results showed that the MU firing based approach led to smaller prediction errors for both joint angles and forces compared with the conventional EMG amplitude based method. The outcomes can help develop intuitive neural-machine interface techniques that allow continuous single-finger level control of robotic hands. In addition, the previously obtained MU separation information was applied directly to new data, and it is therefore possible to enable online extraction of MU firing activities for real-time neural-machine interactions. 
    more » « less
  5. We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications. 
    more » « less