Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.more » « less
-
Transcripts of natural, multi-person meetings differ significantly from documents like news articles, which can make Natural Language Generation models generate unfocused summaries. We develop an abstractive meeting summarizer from both videos and audios of meeting recordings. Specifically, we propose a multi-modal hierarchical attention mechanism across three levels: topic segment, utterance and word. To narrow down the focus into topically-relevant segments, we jointly model topic segmentation and summarization. In addition to traditional textual features, we introduce new multi-modal features derived from visual focus of attention, based on the assumption that an utterance is more important if its speaker receives more attention. Experiments show that our model significantly outperforms the state-of-the-art with both BLEU and ROUGE measures.more » « less
-
Studying group dynamics requires fine-grained spatial and temporal understanding of human behavior. Social psychologists studying human interaction patterns in face-to-face group meetings often find themselves struggling with huge volumes of data that require many hours of tedious manual coding. There are only a few publicly available multi-modal datasets of face-to-face group meetings that enable the development of automated methods to study verbal and non-verbal human behavior. In this paper, we present a new, publicly available multi-modal dataset for group dynamics study that differs from previous datasets in its use of ceiling-mounted, unobtrusive depth sensors. These can be used for fine-grained analysis of head and body pose and gestures, without any concerns about participants' privacy or inhibited behavior. The dataset is complemented by synchronized and time-stamped meeting transcripts that allow analysis of spoken content. The dataset comprises 22 group meetings in which participants perform a standard collaborative group task designed to measure leadership and productivity. Participants' post-task questionnaires, including demographic information, are also provided as part of the dataset. We show the utility of the dataset in analyzing perceived leadership, contribution, and performance, by presenting results of multi-modal analysis using our sensor-fusion algorithms designed to automatically understand audio-visual interactions.more » « less
-
Group discussions are usually aimed at sharing opinions, reaching consensus and making good decisions based on group knowledge. During a discussion, participants might adjust their own opinions as well as tune their attitudes towards others’ opinions, based on the unfolding interactions. In this paper, we demonstrate a framework to visualize such dynamics; at each instant of a conversation, the participants’ opinions and potential influence on their counterparts is easily visualized. We use multi-party meeting opinion mining based on bipartite graphs to extract opinions and calculate mutual influential factors, using the Lunar Survival Task as a study case.more » « less
-
Group meetings can suffer from serious problems that undermine performance, including bias, "groupthink", fear of speaking, and unfocused discussion. To better understand these issues, propose interventions, and thus improve team performance, we need to study human dynamics in group meetings. However, this process currently heavily depends on manual coding and video cameras. Manual coding is tedious, inaccurate, and subjective, while active video cameras can affect the natural behavior of meeting participants. Here, we present a smart meeting room that combines microphones and unobtrusive ceiling-mounted Time-of-Flight (ToF) sensors to understand group dynamics in team meetings. We automatically process the multimodal sensor outputs with signal, image, and natural language processing algorithms to estimate participant head pose, visual focus of attention (VFOA), non-verbal speech patterns, and discussion content. We derive metrics from these automatic estimates and correlate them with user-reported rankings of emergent group leaders and major contributors to produce accurate predictors. We validate our algorithms and report results on a new dataset of lunar survival tasks of 36 individuals across 10 groups collected in the multimodal-sensor-enabled smart room.more » « less
-
Spherical microphone arrays have attained considerable interest in recent years for their ability to decompose three-dimensional soundfields. This paper details real-time capabilities of a source-tracking system composed of a beamforming array and multiple lavalier microphones. Using the lavalier microphones for source identification, a particle filter can be implemented to allow independent tracking of the orientation of multiple sources simultaneously. This source identification and tracking mechanism is utilized in an immersive lab space. In conjunction with networked audiovisual equipment, the system can generate a real-time virtual representation of sound sources for a more dynamic telematic experience.more » « less
-
Spherical microphone arrays have attained considerable interest in recent years for their ability to decompose three-dimensional soundfields. This paper details real-time capabilities of a source-tracking system composed of a beamforming array and multiple lavalier microphones. Using the lavalier microphones for source identification, a particle filter can be implemented to allow independent tracking of the orientation of multiple sources simultaneously. This source identification and tracking mechanism is utilized in an immersive lab space. In conjunction with networked audiovisual equipment, the system can generate a real-time virtual representation of sound sources for a more dynamic telematic experience.more » « less