skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Decoding silent speech commands from articulatory movements through soft magnetic skin and machine learning
Silent speech interfaces have been pursued to restore spoken communication for individuals with voice disorders and to facilitate intuitive communications when acoustic-based speech communication is unreliable, inappropriate, or undesired. However, the current methodology for silent speech faces several challenges, including bulkiness, obtrusiveness, low accuracy, limited portability, and susceptibility to interferences. In this work, we present a wireless, unobtrusive, and robust silent speech interface for tracking and decoding speech-relevant movements of the temporomandibular joint. Our solution employs a single soft magnetic skin placed behind the ear for wireless and socially acceptable silent speech recognition. The developed system alleviates several concerns associated with existing interfaces based on face-worn sensors, including a large number of sensors, highly visible interfaces on the face, and obtrusive interconnections between sensors and data acquisition components. With machine learning-based signal processing techniques, good speech recognition accuracy is achieved (93.2% accuracy for phonemes, and 87.3% for a list of words from the same viseme groups). Moreover, the reported silent speech interface demonstrates robustness against noises from both ambient environments and users’ daily motions. Finally, its potential in assistive technology and human–machine interactions is illustrated through two demonstrations – silent speech enabled smartphone assistants and silent speech enabled drone control.  more » « less
Award ID(s):
2238363 2129673
PAR ID:
10491496
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
The Royal Society of Chemistry
Date Published:
Journal Name:
Materials Horizons
Volume:
10
Issue:
12
ISSN:
2051-6347
Page Range / eLocation ID:
5607 to 5620
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Silent speech interfaces offer an alternative and efficient communication modality for individuals with voice disorders and when the vocalized speech communication is compromised by noisy environments. Despite the recent progress in developing silent speech interfaces, these systems face several challenges that prevent their wide acceptance, such as bulkiness, obtrusiveness, and immobility. Herein, the material optimization, structural design, deep learning algorithm, and system integration of mechanically and visually unobtrusive silent speech interfaces are presented that can realize both speaker identification and speech content identification. Conformal, transparent, and self‐adhesive electromyography electrode arrays are designed for capturing speech‐relevant muscle activities. Temporal convolutional networks are employed for recognizing speakers and converting sensing signals into spoken content. The resulting silent speech interfaces achieve a 97.5% speaker classification accuracy and 91.5% keyword classification accuracy using four electrodes. The speech interface is further integrated with an optical hand‐tracking system and a robotic manipulator for human‐robot collaboration in both assembly and disassembly processes. The integrated system achieves the control of the robot manipulator by silent speech and facilitates the hand‐over process by hand motion trajectory detection. The developed framework enables natural robot control in noisy environments and lays the ground for collaborative human‐robot tasks involving multiple human operators. 
    more » « less
  2. Abstract Lip‐reading provides an effective speech communication interface for people with voice disorders and for intuitive human–machine interactions. Existing systems are generally challenged by bulkiness, obtrusiveness, and poor robustness against environmental interferences. The lack of a truly natural and unobtrusive system for converting lip movements to speech precludes the continuous use and wide‐scale deployment of such devices. Here, the design of a hardware–software architecture to capture, analyze, and interpret lip movements associated with either normal or silent speech is presented. The system can recognize different and similar visemes. It is robust in a noisy or dark environment. Self‐adhesive, skin‐conformable, and semi‐transparent dry electrodes are developed to track high‐fidelity speech‐relevant electromyogram signals without impeding daily activities. The resulting skin‐like sensors can form seamless contact with the curvilinear and dynamic surfaces of the skin, which is crucial for a high signal‐to‐noise ratio and minimal interference. Machine learning algorithms are employed to decode electromyogram signals and convert them to spoken words. Finally, the applications of the developed lip‐reading system in augmented reality and medical service are demonstrated, which illustrate the great potential in immersive interaction and healthcare applications. 
    more » « less
  3. null (Ed.)
    Abstract Human machine interfaces that can track head motion will result in advances in physical rehabilitation, improved augmented reality/virtual reality systems, and aid in the study of human behavior. This paper presents a head position monitoring and classification system using thin flexible strain sensing threads placed on the neck of an individual. A wireless circuit module consisting of impedance readout circuitry and a Bluetooth module records and transmits strain information to a computer. A data processing algorithm for motion recognition provides near real-time quantification of head position. Incoming data is filtered, normalized and divided into data segments. A set of features is extracted from each data segment and employed as input to nine classifiers including Support Vector Machine, Naive Bayes and KNN for position prediction. A testing accuracy of around 92% was achieved for a set of nine head orientations. Results indicate that this human machine interface platform is accurate, flexible, easy to use, and cost effective. 
    more » « less
  4. We introduce a deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications. Our approach is based on a key observation about human speech: there is often a short pause between each sentence or word. In a recorded speech signal, those pauses introduce a series of time periods during which only noise is present. We leverage these incidental silent intervals to learn a model for automatic speech denoising given only mono-channel audio. Detected silent intervals over time expose not just pure noise but its time-varying features, allowing the model to learn noise dynamics and suppress it from the speech signal. Experiments on multiple datasets confirm the pivotal role of silent interval detection for speech denoising, and our method outperforms several state-of-the-art denoising methods, including those that accept only audio input (like ours) and those that denoise based on audiovisual input (and hence require more information). We also show that our method enjoys excellent generalization properties, such as denoising spoken languages not seen during training. 
    more » « less
  5. This paper presents iSpyU, a system that shows the feasibility of recognition of natural speech content played on a phone during conference calls (Skype, Zoom, etc) using a fusion of motion sensors such as accelerometer and gyroscope. While microphones require permissions from the user to be accessible by an app developer, the motion sensors are zero-permission sensors, thus accessible by a developer without alerting the user. This allows a malicious app to potentially eavesdrop on sensitive speech content played by the user's phone. In designing the attack, iSpyU tackles a number of technical challenges including: (i) Low sampling rate of motion sensors (500 Hz in comparison to 44 kHz for a microphone). (ii) Lack of availability of large-scale training datasets to train models for Automatic Speech Recognition (ASR) with motion sensors. iSpyU systematically addresses these challenges by a combination of techniques in synthetic training data generation, ASR modeling, and domain adaptation. Extensive measurement studies on modern smartphones show a word level accuracy of 53.3 - 59.9% over a dictionary of 2000-10000 words, and a character level accuracy of 70.0 - 74.8%. We believe such levels of accuracy poses a significant threat when viewed from a privacy perspective. 
    more » « less