skip to main content


Title: Look at Me When I Talk to You: A Video Dataset to Enable Voice Assistants to Recognize Errors
People interacting with voice assistants are often frustrated by voice assistants' frequent errors and inability to respond to backchannel cues. We introduce an open-source video dataset of 21 participants' interactions with a voice assistant, and explore the possibility of using this dataset to enable automatic error recognition to inform self-repair. The dataset includes clipped and labeled videos of participants' faces during free-form interactions with the voice assistant from the smart speaker's perspective. To validate our dataset, we emulated a machine learning classifier by asking crowdsourced workers to recognize voice assistant errors from watching soundless video clips of participants' reactions. We found trends suggesting it is possible to determine the voice assistant's performance from a participant's facial reaction alone. This work posits elicited datasets of interactive responses as a key step towards improving error recognition for repair for voice assistants in a wide variety of applications.  more » « less
Award ID(s):
1700832
NSF-PAR ID:
10249789
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ArXivorg
ISSN:
2331-8422
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Voice assistants are becoming increasingly pervasive due to the convenience and automation they provide through the voice interface. However, such convenience often comes with unforeseen security and privacy risks. For example, encrypted traffic from voice assistants can leak sensitive information about their users' habits and lifestyles. In this paper, we present a taxonomy of fingerprinting voice commands on the most popular voice assistant platforms (Google, Alexa, and Siri). We also provide a deeper understanding of the feasibility of fingerprinting third-party applications and streaming services over the voice interface. Our analysis not only improves the state-of-the-art technique but also studies a more realistic setup for fingerprinting voice activities over encrypted traffic.Our proposed technique considers a passive network eavesdropper observing encrypted traffic from various devices within a home and, therefore, first detects the invocation/activation of voice assistants followed by what specific voice command is issued. Using an end-to-end system design, we show that it is possible to detect when a voice assistant is activated with 99% accuracy and then utilize the subsequent traffic pattern to infer more fine-grained user activities with around 77-80% accuracy. 
    more » « less
  2. Intelligent voice assistants, and the thirdparty apps (aka “skills” or “actions”) that power them, are increasing in popularity and beginning to experiment with the ability to continuously listen to users. This paper studies how privacy concerns related to such always-listening voice assistants might affect consumer behavior and whether certain privacy mitigations would render them more acceptable. To explore these questions with more realistic user choices, we built an interactive app store that allowed users to install apps for a hypothetical always-listening voice assistant. In a study with 214 participants, we asked users to browse the app store and install apps for different voice assistants that offered varying levels of privacy protections. We found that users were generally more willing to install continuously-listening apps when there were greater privacy protections, but this effect was not universally present. The majority did not review any permissions in detail, but still expressed a preference for stronger privacy protections. Our results suggest that privacy factors into user choice, but many people choose to skip this information. 
    more » « less
  3. This paper presents the design and implementation of Scribe, a comprehensive voice processing and handwriting interface for voice assistants. Distinct from prior works, Scribe is a precise tracking interface that can co-exist with the voice interface on low sampling rate voice assistants. Scribe can be used for 3D free-form drawing, writing, and motion tracking for gaming. Taking handwriting as a specific application, it can also capture natural strokes and the individualized style of writing while occupying only a single frequency. The core technique includes an accurate acoustic ranging method called Cross Frequency Continuous Wave (CFCW) sonar, enabling voice assistants to use ultrasound as a ranging signal while using the regular microphone system of voice assistants as a receiver. We also design a new optimization algorithm that only requires a single frequency for time difference of arrival. Scribe prototype achieves 73 μm of median error for 1D ranging and 1.4 mm of median error in 3D tracking of an acoustic beacon using the microphone array used in voice assistants. Our implementation of an in-air handwriting interface achieves 94.1% accuracy with automatic handwriting-to-text software, similar to writing on paper (96.6%). At the same time, the error rate of voice-based user authentication only increases from 6.26% to 8.28%.

     
    more » « less
  4. Intelligent voice assistants may soon become proactive, offering suggestions without being directly invoked. Such behavior increases privacy risks, since proactive operation requires continuous monitoring of conversations. To mitigate this problem, our study proposes and evaluates one potential privacy control, in which the assistant requests permission for the information it wishes to use immediately after hearing it. To find out how people would react to runtime permission requests, we recruited 23 pairs of participants to hold conversations while receiving ambient suggestions from a proactive assistant, which we simulated in real time using the Wizard of Oz technique. The interactive sessions featured different modes and designs of runtime permission requests and were followed by in-depth interviews about people's preferences and concerns. Most participants were excited about the devices despite their continuous listening, but wanted control over the assistant's actions and their own data. They generally prioritized an interruption-free experience above more fine-grained control over what the device would hear. 
    more » « less
  5. null (Ed.)
    Speaker recognition as a biometric modality is on the rise in the consumer marketplace for banking, online services, and personal assistant services with a potential for wider application areas. Most current applications involve adults. One of the biggest challenges in speaker recognition for children is the change in the voice properties as a child age. This work proposes a baseline longitudinal dataset from the same 30 children in the age group of 4 to 14 years over a time frame of 2.5 years and evaluates speaker recognition performance in children with the available speaker recognition technology. 
    more » « less