Silent speech interfaces have been pursued to restore spoken communication for individuals with voice disorders and to facilitate intuitive communications when acoustic-based speech communication is unreliable, inappropriate, or undesired. However, the current methodology for silent speech faces several challenges, including bulkiness, obtrusiveness, low accuracy, limited portability, and susceptibility to interferences. In this work, we present a wireless, unobtrusive, and robust silent speech interface for tracking and decoding speech-relevant movements of the temporomandibular joint. Our solution employs a single soft magnetic skin placed behind the ear for wireless and socially acceptable silent speech recognition. The developed system alleviates several concerns associated with existing interfaces based on face-worn sensors, including a large number of sensors, highly visible interfaces on the face, and obtrusive interconnections between sensors and data acquisition components. With machine learning-based signal processing techniques, good speech recognition accuracy is achieved (93.2% accuracy for phonemes, and 87.3% for a list of words from the same viseme groups). Moreover, the reported silent speech interface demonstrates robustness against noises from both ambient environments and users’ daily motions. Finally, its potential in assistive technology and human–machine interactions is illustrated through two demonstrations – silent speech enabled smartphone assistants and silent speech enabled drone control.
more »
« less
Decoding Silent Speech Cues From Muscular Biopotential Signals for Efficient Human‐Robot Collaborations
Abstract Silent speech interfaces offer an alternative and efficient communication modality for individuals with voice disorders and when the vocalized speech communication is compromised by noisy environments. Despite the recent progress in developing silent speech interfaces, these systems face several challenges that prevent their wide acceptance, such as bulkiness, obtrusiveness, and immobility. Herein, the material optimization, structural design, deep learning algorithm, and system integration of mechanically and visually unobtrusive silent speech interfaces are presented that can realize both speaker identification and speech content identification. Conformal, transparent, and self‐adhesive electromyography electrode arrays are designed for capturing speech‐relevant muscle activities. Temporal convolutional networks are employed for recognizing speakers and converting sensing signals into spoken content. The resulting silent speech interfaces achieve a 97.5% speaker classification accuracy and 91.5% keyword classification accuracy using four electrodes. The speech interface is further integrated with an optical hand‐tracking system and a robotic manipulator for human‐robot collaboration in both assembly and disassembly processes. The integrated system achieves the control of the robot manipulator by silent speech and facilitates the hand‐over process by hand motion trajectory detection. The developed framework enables natural robot control in noisy environments and lays the ground for collaborative human‐robot tasks involving multiple human operators.
more »
« less
- PAR ID:
- 10573668
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Advanced Materials Technologies
- Volume:
- 10
- Issue:
- 4
- ISSN:
- 2365-709X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Despite their technical advancements, commercially available telerobots are limited in social interaction capabilities for both pilot and local users, specifically in nonverbal communication. Our group hypothesizes that the introduction of expressive gesturing and tangible interaction capabilities (e.g., handshakes, fist bumps) will enhance telerobotic interactions and increase social connection between users. To investigate the affordances to social connection that gestures and tangible interactions provide in telerobot-mediated interactions, we designed and integrated a lightweight manipulator terminating in an anthropomorphic end effector onto a commercially available telerobot (Anybots QB 2.0). Through virtual reality tracking of the pilot user’s arm and hand, expressive gestures and social contact interactions are recreated via the manipulator, enabling a pilot user and a local user to engage in a tangible exchange. To assess the usability and effectiveness of the gesturing system, we present evaluations from both the local and pilot user perspectives. First, we present a validation study to assess usability of the control system by the pilot user. Our results demonstrate that pilot user interactions can be replicated with a greater than 80% pass rate and mean ease of use rating of\(7.08 \pm 1.32\)(out of 10) with brief training. Finally, we present a user study to assess the social impacts of (1) using the telerobot without the manipulator from both the pilot user and local user perspectives and (2) using the control system and telerobotic manipulator from both the pilot user and local user perspectives. Results demonstrate that the robot with the manipulator elicited a more positive social experience than the robot without the arm for local users but no significant difference in conditions for pilot users. Future work will focus on improving the pilot user experience to support social contact interactions.more » « less
-
An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker diarization system we found was based on the ECAPA-TDNN speaker embedding model and used Whisper automatic speech recognition to identify speech segments. The diarization error rate (DER) in challenging noisy spontaneous classroom data was around 34%, and the correlations of estimated vs. human annotations of how much each student spoke reached 0.62. The accuracy of distinguishing teachers' speech from children's speech was 69.17%. We evaluated the system for potential accuracy bias across people of different skin tones and genders and found that the accuracy did not show statistically significantly differences across either dimension. Thus, the presented diarization system has potential to benefit educational research and to provide teachers and students with useful feedback to better understand their classroom dynamics.more » « less
-
Abstract Modern automatic speech recognition (ASR) systems are capable of impressive performance recognizing clean speech but struggle in noisy, multi-talker environments, commonly referred to as the “cocktail party problem.” In contrast, many human listeners can solve this problem, suggesting the existence of a solution in the brain. Here we present a novel approach that uses a brain inspired sound segregation algorithm (BOSSA) as a preprocessing step for a state-of-the-art ASR system (Whisper). We evaluated BOSSA’s impact on ASR accuracy in a spatialized multi-talker scene with one target speaker and two competing maskers, varying the difficulty of the task by changing the target-to-masker ratio. We found that median word error rate improved by up to 54% when the target-to-masker ratio was low. Our results indicate that brain-inspired algorithms have the potential to considerably enhance ASR accuracy in challenging multi-talker scenarios without the need for retraining or fine-tuning existing state-of-the-art ASR systems.more » « less
-
Abstract As artificial intelligence and industrial automation are developing, human–robot collaboration (HRC) with advanced interaction capabilities has become an increasingly significant area of research. In this paper, we design and develop a real-time, multi-model HRC system using speech and gestures. A set of 16 dynamic gestures is designed for communication from a human to an industrial robot. A data set of dynamic gestures is designed and constructed, and it will be shared with the community. A convolutional neural network is developed to recognize the dynamic gestures in real time using the motion history image and deep learning methods. An improved open-source speech recognizer is used for real-time speech recognition of the human worker. An integration strategy is proposed to integrate the gesture and speech recognition results, and a software interface is designed for system visualization. A multi-threading architecture is constructed for simultaneously operating multiple tasks, including gesture and speech data collection and recognition, data integration, robot control, and software interface operation. The various methods and algorithms are integrated to develop the HRC system, with a platform constructed to demonstrate the system performance. The experimental results validate the feasibility and effectiveness of the proposed algorithms and the HRC system.more » « less
An official website of the United States government
