skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, February 13 until 2:00 AM ET on Friday, February 14 due to maintenance. We apologize for the inconvenience.


Title: A Live Speech-Driven Avatar-Mediated Three-Party Telepresence System: Design and Evaluation
Abstract In this article, we present a live speech-driven, avatar-mediated, three-party telepresence system, through which three distant users, embodied as avatars in a shared 3D virtual world, can perform natural three-party telepresence that does not require tracking devices. Based on live speech input from three users, this system can real-time generate the corresponding conversational motions of all the avatars, including head motion, eye motion, lip movement, torso motion, and hand gesture. All motions are generated automatically at each user side based on live speech input, and a cloud server is utilized to transmit and synchronize motion and speech among different users. We conduct a formal user study to evaluate the usability and effectiveness of the system by comparing it with a well-known online virtual world, Second Life, and a widely-used online teleconferencing system, Skype. The user study results indicate our system can provide a measurably better telepresence user experience than the two widely-used methods.  more » « less
Award ID(s):
2005430 1524782
PAR ID:
10359118
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
PRESENCE: Virtual and Augmented Reality
Volume:
29
ISSN:
1531-3263
Page Range / eLocation ID:
113 to 139
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Virtual Reality (VR) telepresence platforms are being challenged to support live performances, sporting events, and conferences with thousands of users across seamless virtual worlds. Current systems have struggled to meet these demands which has led to high-profile performance events with groups of users isolated in parallel sessions. The core difference in scaling VR environments compared to classic 2D video content delivery comes from the dynamic peer-to-peer spatial dependence on communication. Users have many pair-wise interactions that grow and shrink as they explore spaces. In this paper, we discuss the challenges of VR scaling and present an architecture that supports hundreds of users with spatial audio and video in a single virtual environment. We leverage the property of \textit{spatial locality} with two key optimizations: (1) a Quality of Service (QoS) scheme to prioritize audio and video traffic based on users' locality, and (2) a resource manager that allocates client connections across multiple servers based on user proximity within the virtual world. Through real-world deployments and extensive evaluations under real and simulated environments, we demonstrate the scalability of our platform while showing improved QoS compared with existing approaches. 
    more » « less
  2. The use of virtual humans (i.e., avatars) holds the potential for interactive, automated interaction in domains such as remote communication, customer service, or public announcements. For signed language users, signing avatars could potentially provide accessible content by sharing information in the signer's preferred or native language. As the development of signing avatars has gained traction in recent years, researchers have come up with many different methods of creating signing avatars. The resulting avatars vary widely in their appearance, the naturalness of their movements, and facial expressions—all of which may potentially impact users' acceptance of the avatars. We designed a study to test the effects of these intrinsic properties of different signing avatars while also examining the extent to which people's own language experiences change their responses to signing avatars. We created video stimuli showing individual signs produced by (1) a live human signer (Human), (2) an avatar made using computer-synthesized animation (CS Avatar), and (3) an avatar made using high-fidelity motion capture (Mocap avatar). We surveyed 191 American Sign Language users, including Deaf ( N = 83), Hard-of-Hearing ( N = 34), and Hearing ( N = 67) groups. Participants rated the three signers on multiple dimensions, which were then combined to form ratings of Attitudes, Impressions, Comprehension, and Naturalness. Analyses demonstrated that the Mocap avatar was rated significantly more positively than the CS avatar on all primary variables. Correlations revealed that signers who acquire sign language later in life are more accepting of and likely to have positive impressions of signing avatars. Finally, those who learned ASL earlier were more likely to give lower, more negative ratings to the CS avatar, but we did not see this association for the Mocap avatar or the Human signer. Together, these findings suggest that movement quality and appearance significantly impact users' ratings of signing avatars and show that signed language users with earlier age of ASL acquisition are the most sensitive to movement quality issues seen in computer-generated avatars. We suggest that future efforts to develop signing avatars consider retaining the fluid movement qualities integral to signed languages. 
    more » « less
  3. In this paper we propose a novel conditional generative adversarial network (cGAN) architecture, called S2M-Net, to holistically synthesize realistic three-party conversational animations based on acoustic speech input together with speaker marking (i.e., the speak- ing time of each interlocutor). Specifically, based on a pre-collected three-party conversational motion dataset, we design and train the S2M-Net for three-party conversational animation synthesis. In the architecture, a generator contains a LSTM encoder to encode a sequence of acoustic speech features to a latent vector that is further fed into a transform unit to transform the latent vector into a gesture kinematics space. Then, the output of this transform unit is fed into a LSTM decoder to generate corresponding three-party conversational gesture kinematics. Meanwhile, a discriminator is implemented to check whether an input sequence of three-party conversational gesture kinematics is real or fake. To evaluate our method, besides quantitative and qualitative evaluations, we also conducted paired comparison user studies to compare it with the state of the art. 
    more » « less
  4. We present here a new system, in which signing avatars (computer-animated virtual humans built from motion capture recordings) teach introductory American Sign Language (ASL) in an immersive virtual environment. The system is called Signing Avatars & Immersive Learning (SAIL). The significant contributions of this work are 1) the use of signing avatars, built from state-of-the-art motion capture recordings of a native signer; 2) the integration with LEAP gesture tracking hardware, allowing the user to see his or her own movements within the virtual environment; 3) the development of appropriate introductory ASL vocabulary, delivered in semi-interactive lessons; and 4) the 3D environment in which a user accesses the system. 
    more » « less
  5. Shared control systems can make complex robot teleoperation tasks easier for users. These systems predict the user’s goal, determine the motion required for the robot to reach that goal, and combine that motion with the user’s input. Goal prediction is generally based on the user’s control input (e.g., the joystick signal). In this paper, we show that this prediction method is especially effective when users follow standard noisily optimal behavior models. In tasks with input constraints like modal control, however, this effectiveness no longer holds, so additional sources for goal prediction can improve assistance. We implement a novel shared control system that combines natural eye gaze with joystick input to predict people’s goals online, and we evaluate our system in a real-world, COVID-safe user study. We find that modal control reduces the efficiency of assistance according to our model, and when gaze provides a prediction earlier in the task, the system’s performance improves. However, gaze on its own is unreliable and assistance using only gaze performs poorly. We conclude that control input and natural gaze serve different and complementary roles in goal prediction, and using them together leads to improved assistance. 
    more » « less