Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models don’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.
more »
« less
Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors
Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations.
more »
« less
- PAR ID:
- 10126148
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 33
- ISSN:
- 2159-5399
- Page Range / eLocation ID:
- 7216 to 7223
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Over the past 35 years, it has been established that mental representations of language include fine-grained acoustic details stored in episodic memory. The empirical foundations of this fact were established through a series of word recognition experiments showing that participants were better at remembering words repeated by the same talker than words repeated by a different talker (talker-specificity effect). This effect has been widely replicated, but exclusively with isolated, generally monosyllabic, words as the object of study. Whether fine-grained acoustic detail plays a role in the encoding and retrieval of larger structures, such as spoken sentences, has important implications for theories of language understanding in natural communicative contexts. In this study, we extended traditional recognition memory methods to use full spoken sentences rather than individual words as stimuli. Additionally, we manipulated attention at the time of encoding in order to probe the automaticity of fine-grained acoustic encoding. Participants were more accurate for sentences repeated by the same talker than by a different talker. They were also faster and more accurate in the Full Attention than in the Divided Attention condition. The specificity effect was more pronounced for the Divided Attention than the Full Attention group. These findings provide evidence for specificity at the sentence level. They also highlight the implicit, automatic encoding of fine-grained acoustic detail and point to a central role for cognitive resource allocation in shaping memory-based language representations.more » « less
-
Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. Since PLMs capture word semantics in different contexts, the quality of word representations highly depends on word frequency, which usually follows a heavy-tailed distributions in the pre-training corpus. Therefore, the embeddings of rare words on the tail are usually poorly optimized. In this work, we focus on enhancing language model pre-training by leveraging definitions of the rare words in dictionaries (e.g., Wiktionary). To incorporate a rare word definition as a part of input, we fetch its definition from the dictionary and append it to the end of the input text sequence. In addition to training with the masked language modeling objective, we propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions to enhance language modeling representation with dictionary. We evaluate the proposed Dict-BERT model on the language understanding benchmark GLUE and eight specialized domain benchmark datasets. Extensive experiments demonstrate that Dict-BERT can significantly improve the understanding of rare words and boost model performance on various NLP downstream tasks.more » « less
-
Background/Objectives: The Implicit Prosody Hypothesis (IPH) posits that individuals generate internal prosodic representations during silent reading, mirroring those produced in spoken language. While converging behavioral evidence supports the IPH, the underlying neurocognitive mechanisms remain largely unknown. Therefore, this study investigated the neurophysiological markers of sensitivity to speech rhythm cues during silent word reading. Methods: EEGs were recorded while participants silently read four-word sequences, each composed of either trochaic words (stressed on the first syllable) or iambic words (stressed on the second syllable). Each sequence was followed by a target word that was either metrically congruent or incongruent with the preceding rhythmic pattern. To investigate the effects of metrical expectancy and lexical stress type, we examined single-trial event-related potentials (ERPs) and time–frequency representations (TFRs) time-locked to target words. Results: The results showed significant differences based on the stress pattern expectancy and type. Specifically, words that carried unexpected stress elicited larger ERP negativities between 240 and 628 ms after the word onset. Furthermore, different frequency bands were sensitive to distinct aspects of the rhythmic structure in language. Alpha activity tracked the rhythmic expectations, and theta and beta activities were sensitive to both the expected rhythms and specific locations of the stressed syllables. Conclusions: The findings clarify neurocognitive mechanisms of phonological and lexical mental representations during silent reading using a conservative data-driven approach. Similarity with neural response patterns previously reported for spoken language contexts suggests shared neural networks for implicit and explicit speech rhythm processing, further supporting the IPH and emphasizing the centrality of prosody in reading.more » « less
-
Computational models of verbal analogy and relational similarity judgments can employ different types of vector representations of word meanings (embeddings) generated by machine-learning algorithms. An important question is whether human-like relational processing depends on explicit representations of relations (i.e., representations separable from those of the concepts being related), or whether implicit relation representations suffice. Earlier machine-learning models produced static embeddings for individual words, identical across all contexts. However, more recent Large Language Models (LLMs), which use transformer architectures applied to much larger training corpora, are able to produce contextualized embeddings that have the potential to capture implicit knowledge of semantic relations. Here we compare multiple models based on different types of embeddings to human data concerning judgments of relational similarity and solutions of verbal analogy problems. For two datasets, a model that learns explicit representations of relations, Bayesian Analogy with Relational Transformations (BART), captured human performance more successfully than either a model using static embeddings (Word2vec) or models using contextualized embeddings created by LLMs (BERT, RoBERTa, and GPT-2). These findings support the proposal that human thinking depends on representations that separate relations from the concepts they relate.more » « less
An official website of the United States government

