Abstract This study compares how English-speaking adults and children from the United States adapt their speech when talking to a real person and a smart speaker (Amazon Alexa) in a psycholinguistic experiment. Overall, participants produced more effortful speech when talking to a device (longer duration and higher pitch). These differences also varied by age: children produced even higher pitch in device-directed speech, suggesting a stronger expectation to be misunderstood by the system. In support of this, we see that after a staged recognition error by the device, children increased pitch even more. Furthermore, both adults and children displayed the same degree of variation in their responses for whether “Alexa seems like a real person or not”, further indicating that children’s conceptualization of the system’s competence shaped their register adjustments, rather than an increased anthropomorphism response. This work speaks to models on the mechanisms underlying speech production, and human–computer interaction frameworks, providing support for routinized theories of spoken interaction with technology.
more »
« less
Analyzing the relationship between productivity and human communication in an organizational setting
Though it is often taken as a truism that communication contributes to organizational productivity, there are surprisingly few empirical studies documenting a relationship between observable interaction and productivity. This is because comprehensive, direct observation of communication in organizational settings is notoriously difficult. In this paper, we report a method for extracting network and speech characteristics data from audio recordings of participants talking with each other in real time. We use this method to analyze communication and productivity data from seventy-nine employees working within a software engineering organization who had their speech recorded during working hours for a period of approximately 3 years. From the speech data, we infer when any two individuals are talking to each other and use this information to construct a communication graph for the organization for each week. We use the spectral and temporal characteristics of the produced speech and the structure of the resultant communication graphs to predict the productivity of the group, as measured by the number of lines of code produced. The results indicate that the most important speech and network features for predicting productivity include those that measure the number of unique people interacting within the organization, the frequency of interactions, and the topology of the communication network.
more »
« less
- Award ID(s):
- 1632707
- PAR ID:
- 10279608
- Editor(s):
- Ruban, Nersisson
- Date Published:
- Journal Name:
- PLOS ONE
- Volume:
- 16
- Issue:
- 7
- ISSN:
- 1932-6203
- Page Range / eLocation ID:
- e0250301
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Young children’s friendships fuel essential developmental outcomes (e.g., social-emotional competence) and are thought to provide even greater benefits to children with or at-risk for disabilities. Teacher and parent report and sociometric measures are commonly used to measure friendships, and ecobehavioral assessment has been used to capture its features on a momentary basis. In this proof-of-concept study, we use Ubisense, the Language ENvironmental Analysis (LENA) recorder, and advanced speech processing algorithms to capture features of friendship –child-peer speech and proximity within activity areas . We collected 12,332 1-second speech and location data points. Our preliminary results indicate the focal child at-risk for a disability and each playmate spent time vocalizing near one another across 4 activity areas. Additionally, compared to the Blocks activity area, the children had significantly lower odds of talking while in proximity during Manipulatives and Science. This suggests that the activity areas children occupy may affect their engagement with peers and, in turn, the friendships they development. The proposed approach is a groundbreaking advance to understanding and supporting children’s friendships.more » « less
-
Voice recognition has become an integral part of our lives, commonly used in call centers and as part of virtual assistants. However, voice recognition is increasingly applied to more industrial uses. Each of these use cases has unique characteristics that may impact the effectiveness of voice recognition, which could impact industrial productivity, performance, or even safety. One of the most prominent among them is the unique background noises that are dominant in each industry. The existence of different machinery and different work layouts are primary contributors to this. Another important characteristic is the type of communication that is present in these settings. Daily communication often involves longer sentences uttered under relatively silent conditions, whereas communication in industrial settings is often short and conducted in loud conditions. In this study, we demonstrated the importance of taking these two elements into account by comparing the performances of two voice recognition algorithms under several background noise conditions: a regular Convolutional Neural Network (CNN)-based voice recognition algorithm to an Auto Speech Recognition (ASR)-based model with a denoising module. Our results indicate that there is a significant performance drop between the typical background noise use (white noise) and the rest of the background noises. Also, our custom ASR model with the denoising module outperformed the CNN-based model with an overall performance increase between 14–35% across all background noises. Both results give proof that specialized voice recognition algorithms need to be developed for these environments to reliably deploy them as control mechanisms.more » « less
-
null (Ed.)Although state-of-the-art smart speakers can hear a user's speech, unlike a human assistant these devices cannot figure out users' verbal references based on their head location and orientation. Soundr presents a novel interaction technique that leverages the built-in microphone array found in most smart speakers to infer the user's spatial location and head orientation using only their voice. With that extra information, Soundr can figure out users references to objects, people, and locations based on the speakers' gaze, and also provide relative directions. To provide training data for our neural network, we collected 751 minutes of data (50x that of the best prior work) from human speakers leveraging a virtual reality headset to accurately provide head tracking ground truth. Our results achieve an average positional error of 0.31m and an orientation angle accuracy of 34.3° for each voice command. A user study to evaluate user preferences for controlling IoT appliances by talking at them found this new approach to be fast and easy to use.more » « less
-
null (Ed.)This paper investigates users’ speech rate adjustments during conversations with an Amazon Alexa socialbot in response to situational (in-lab vs. at-home) and communicative (ASR comprehension errors) factors. We collected user interaction studies and measured speech rate at each turn in the conversation and in baseline productions (collected prior to the interaction). Overall, we find that users slow their speech rate when talking to the bot, relative to their pre-interaction productions, consistent with hyperarticulation. Speakers use an even slower speech rate in the in-lab setting (relative to at-home). We also see evidence for turn-level entrainment: the user follows the directionality of Alexa’s changes in rate in the immediately preceding turn. Yet, we do not see differences in hyperarticulation or entrainment in response to ASR errors, or on the basis of user ratings of the interaction. Overall, this work has implications for human-computer interaction and theories of linguistic adaptation and entrainment.more » « less
An official website of the United States government

