The increased prevalence of online meetings has significantly en- hanced the practicality of a model that can automatically generate the summary of a given meeting. This paper introduces a novel and effective approach to automate the generation of meeting sum- maries. Current approaches to this problem generate general and basic summaries, considering the meeting simply as a long dialogue. However, our novel algorithms can generate abstractive meeting summaries that are driven by the action items contained in the meet- ing transcript. This is done by recursively generating summaries and employing our action-item extraction algorithm for each sec- tion of the meeting in parallel. All of these sectional summaries are then combined and summarized together to create a coherent and action-item-driven summary. In addition, this paper introduces three novel methods for dividing up long transcripts into topic- based sections to improve the time efficiency of our algorithm, as well as to resolve the issue of large language models (LLMs) forget- ting long-term dependencies. Our pipeline achieved a BERTScore of 64.98 across the AMI corpus, which is an approximately 4.98% increase from the current state-of-the-art result produced by a fine-tuned BART (Bidirectional and Auto-Regressive Transformers) model.
more »
« less
Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization
Transcripts of natural, multi-person meetings differ significantly from documents like news articles, which can make Natural Language Generation models generate unfocused summaries. We develop an abstractive meeting summarizer from both videos and audios of meeting recordings. Specifically, we propose a multi-modal hierarchical attention mechanism across three levels: topic segment, utterance and word. To narrow down the focus into topically-relevant segments, we jointly model topic segmentation and summarization. In addition to traditional textual features, we introduce new multi-modal features derived from visual focus of attention, based on the assumption that an utterance is more important if its speaker receives more attention. Experiments show that our model significantly outperforms the state-of-the-art with both BLEU and ROUGE measures.
more »
« less
- Award ID(s):
- 1631674
- PAR ID:
- 10107386
- Date Published:
- Journal Name:
- 57th Conference of the Association for Computational Linguistics
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The increased prevalence of online meetings has significantly en- hanced the practicality of a model that can automatically generate the summary of a given meeting. This paper introduces a novel and effective approach to automate the generation of meeting sum- maries. Current approaches to this problem generate general and basic summaries, considering the meeting simply as a long dia- logue. However, our novel algorithms can generate abstractive meet- ing summaries that are driven by the action items contained in the meeting transcript. This is done by recursively generating sum- maries and employing our action-item extraction algorithm for each section of the meeting in parallel. All of these sectional sum- maries are then combined and summarized together to create a coherent and action-item-driven summary. In addition, this paper introduces three novel methods for dividing up long transcripts into topic-based sections to improve the time efficiency of our al- gorithm, as well as to resolve the issue of large language models (LLMs) forgetting long-term dependencies. Our pipeline achieved a BERTScore of 64.98 across the AMI corpus, which is an approxi- mately 4.98% increase from the current state-of-the-art result pro- duced by a fine-tuned BART (Bidirectional and Auto-Regressive Transformers) model.more » « less
-
Group meetings can suffer from serious problems that undermine performance, including bias, "groupthink", fear of speaking, and unfocused discussion. To better understand these issues, propose interventions, and thus improve team performance, we need to study human dynamics in group meetings. However, this process currently heavily depends on manual coding and video cameras. Manual coding is tedious, inaccurate, and subjective, while active video cameras can affect the natural behavior of meeting participants. Here, we present a smart meeting room that combines microphones and unobtrusive ceiling-mounted Time-of-Flight (ToF) sensors to understand group dynamics in team meetings. We automatically process the multimodal sensor outputs with signal, image, and natural language processing algorithms to estimate participant head pose, visual focus of attention (VFOA), non-verbal speech patterns, and discussion content. We derive metrics from these automatic estimates and correlate them with user-reported rankings of emergent group leaders and major contributors to produce accurate predictors. We validate our algorithms and report results on a new dataset of lunar survival tasks of 36 individuals across 10 groups collected in the multimodal-sensor-enabled smart room.more » « less
-
Manual examination of chest x-rays is a time consuming process that involves significant effort by expert radiologists. Recent work attempts to alleviate this problem by developing learning-based automated chest x-ray analysis systems that map images to multi-label diagnoses using deep neural net- works. These methods are often treated as black boxes, or they output attention maps but don’t explain why the attended areas are important. Given data consisting of a frontal-view x-ray, a set of natural language findings, and one or more diagnostic impressions, we propose a deep neural network model that during training simultaneously 1) constructs a topic model which clusters key terms from the findings into meaningful groups, 2) predicts the presence of each topic for a given input image based on learned visual features, and 3) uses an image’s predicted topic encoding as features to predict one or more diagnoses. Since the net learns the topic model jointly with the classifier, it gives us a powerful tool for understanding which semantic concepts the net might be ex- ploiting when making diagnoses, and since we constrain the net to predict topics based on expert-annotated reports, the net automatically encodes some higher-level expert knowledge about how to make diagnoses.more » « less
-
We compare two approaches for training statistical parametric voices that make use of acoustic and prosodic features at the utterance level with the aim of improving naturalness of the resultant voices -- subset adaptation, and adding new acoustic and prosodic features at the frontend. We have found that the approach of labeling high, middle, or low values for a given feature at the frontend and then choosing which setting to use at synthesis time can produce voices rated as significantly more natural than a baseline voice that uses only the standard contextual frontend features, for both HMM-based and neural network-based synthesis.more » « less
An official website of the United States government

