skip to main content

Title: MuteIt: Jaw Motion Based Unvoiced Command Recognition Using Earable
In this paper, we present MuteIt, an ear-worn system for recognizing unvoiced human commands. MuteIt presents an intuitive alternative to voice-based interactions that can be unreliable in noisy environments, disruptive to those around us, and compromise our privacy. We propose a twin-IMU set up to track the user's jaw motion and cancel motion artifacts caused by head and body movements. MuteIt processes jaw motion during word articulation to break each word signal into its constituent syllables, and further each syllable into phonemes (vowels, visemes, and plosives). Recognizing unvoiced commands by only tracking jaw motion is challenging. As a secondary articulator, jaw motion is not distinctive enough for unvoiced speech recognition. MuteIt combines IMU data with the anatomy of jaw movement as well as principles from linguistics, to model the task of word recognition as an estimation problem. Rather than employing machine learning to train a word classifier, we reconstruct each word as a sequence of phonemes using a bi-directional particle filter, enabling the system to be easily scaled to a large set of words. We validate MuteIt for 20 subjects with diverse speech accents to recognize 100 common command words. MuteIt achieves a mean word recognition accuracy of 94.8% in noise-free conditions. When compared with common voice assistants, MuteIt outperforms them in noisy acoustic environments, achieving higher than 90% recognition accuracy. Even in the presence of motion artifacts, such as head movement, walking, and riding in a moving vehicle, MuteIt achieves mean word recognition accuracy of 91% over all scenarios.  more » « less
Award ID(s):
2110193 2132112
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
Page Range / eLocation ID:
1 to 26
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Silent speech interfaces have been pursued to restore spoken communication for individuals with voice disorders and to facilitate intuitive communications when acoustic-based speech communication is unreliable, inappropriate, or undesired. However, the current methodology for silent speech faces several challenges, including bulkiness, obtrusiveness, low accuracy, limited portability, and susceptibility to interferences. In this work, we present a wireless, unobtrusive, and robust silent speech interface for tracking and decoding speech-relevant movements of the temporomandibular joint. Our solution employs a single soft magnetic skin placed behind the ear for wireless and socially acceptable silent speech recognition. The developed system alleviates several concerns associated with existing interfaces based on face-worn sensors, including a large number of sensors, highly visible interfaces on the face, and obtrusive interconnections between sensors and data acquisition components. With machine learning-based signal processing techniques, good speech recognition accuracy is achieved (93.2% accuracy for phonemes, and 87.3% for a list of words from the same viseme groups). Moreover, the reported silent speech interface demonstrates robustness against noises from both ambient environments and users’ daily motions. Finally, its potential in assistive technology and human–machine interactions is illustrated through two demonstrations – silent speech enabled smartphone assistants and silent speech enabled drone control. 
    more » « less
  2. In this paper, we present Jawthenticate, an earable system that authenticates a user using audible or inaudible speech without us- ing a microphone. This system can overcome the shortcomings of traditional voice-based authentication systems like unreliability in noisy conditions and spoofing using microphone-based replay attacks. Jawthenticate derives distinctive speech-related features from the jaw motion and associated facial vibrations. This combi- nation of features makes Jawthenticate resilient to vocal imitations as well as camera-based spoofing. We use these features to train a two-class SVM classifier for each user. Our system is invariant to the content and language of speech. In a study conducted with 41 subjects, who speak different native languages, Jawthenticate achieves a Balanced Accuracy (BAC) of 97.07%, True Positive Rate (TPR) of 97.75%, and True Negative Rate (TNR) of 96.4% with just 3 seconds of speech data. 
    more » « less
  3. Muresan, Smaranda ; Nakov, Preslav ; Villavicencio, Aline (Ed.)
    Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms. 
    more » « less
  4. Deployed social robots are increasingly relying on wakeword-based interaction, where interactions are human-initiated by a wakeword like “Hey Jibo”. While wakewords help to increase speech recognition accuracy and ensure privacy, there is concern that wakeword-driven interaction could encourage impolite behavior because wakeword-driven speech is typically phrased as commands. To address these concerns, companies have sought to use wake- word design to encourage interactant politeness, through wakewords like “⟨Name⟩, please”. But while this solution is intended to encourage people to use more “polite words”, researchers have found that these wakeword designs actually decrease interactant politeness in text-based communication, and that other wakeword designs could better encourage politeness by priming users to use Indirect Speech Acts. Yet there has been no previous research to directly compare these wakewords designs in in-person, voice-based human-robot interaction experiments, and previous in-person HRI studies could not effectively study carryover of wakeword-driven politeness and impoliteness into human-human interactions. In this work, we conceptually reproduced these previous studies (n=69) to assess how the wakewords “Hey ⟨Name⟩”, “Excuse me ⟨Name⟩”, and “⟨Name⟩, please” impact robot-directed and human-directed politeness. Our results demonstrate the ways that different types of linguistic priming interact in nuanced ways to induce different types of robot-directed and human-directed politeness. 
    more » « less
  5. Human speech perception involves transforming a countinuous acoustic signal into discrete linguistically meaningful units (phonemes) while simultaneously causing a listener to activate words that are similar to the spoken utterance and to each other. The Neighborhood Activation Model posits that phonological neighbors (two forms [words] that differ by one phoneme) compete significantly for recognition as a spoken word is heard. This definition of phonological similarity can be extended to an entire corpus of forms to produce a phonological neighbor network (PNN). We study PNNs for five languages: English, Spanish, French, Dutch, and German. Consistent with previous work, we find that the PNNs share a consistent set of topological features. Using an approach that generates random lexicons with increasing levels of phonological realism, we show that even random forms with minimal relationship to any real language, combined with only the empirical distribution of language-specific phonological form lengths, are sufficient to produce the topological properties observed in the real language PNNs. The resulting pseudo-PNNs are insensitive to the level of lingustic realism in the random lexicons but quite sensitive to the shape of the form length distribution. We therefore conclude that “universal” features seen across multiple languages are really string universals, not language universals, and arise primarily due to limitations in the kinds of networks generated by the one-step neighbor definition. Taken together, our results indicate that caution is warranted when linking the dynamics of human spoken word recognition to the topological properties of PNNs, and that the investigation of alternative similarity metrics for phonological forms should be a priority. 
    more » « less