Multimodal fusion addresses the problem of analyzing spoken words in the multimodal context, including visual expressions and prosodic cues. Even when multimodal models lead to performance improvements, it is often unclear whether bimodal and trimodal interactions are learned or whether modalities are processed independently of each other. We propose Multimodal Residual Optimization (MRO) to separate unimodal, bimodal, and trimodal interactions in a multimodal model. This improves interpretability as the multimodal interaction can be quantified. Inspired by Occam’s razor, the main intuition of MRO is that (simpler) unimodal contributions should be learned before learning (more complex) bimodal and trimodal interactions. For example, bimodal predictions should learn to correct the mistakes (residuals) of unimodal predictions, thereby letting the bimodal predictions focus on the remaining bimodal interactions. Empirically, we observe that MRO successfully separates unimodal, bimodal, and trimodal interactions while not degrading predictive performance. We complement our empirical results with a human perception study and observe that MRO learns multimodal interactions that align with human judgments.
more »
« less
The Role of Unimodal Feedback Pathways in Gender Perception During Activation of Voice and Face Areas
Cross-modal effects provide a model framework for investigating hierarchical inter-areal processing, particularly, under conditions where unimodal cortical areas receive contextual feedback from other modalities. Here, using complementary behavioral and brain imaging techniques, we investigated the functional networks participating in face and voice processing during gender perception, a high-level feature of voice and face perception. Within the framework of a signal detection decision model, Maximum likelihood conjoint measurement (MLCM) was used to estimate the contributions of the face and voice to gender comparisons between pairs of audio-visual stimuli in which the face and voice were independently modulated. Top–down contributions were varied by instructing participants to make judgments based on the gender of either the face, the voice or both modalities ( N = 12 for each task). Estimated face and voice contributions to the judgments of the stimulus pairs were not independent; both contributed to all tasks, but their respective weights varied over a 40-fold range due to top–down influences. Models that best described the modal contributions required the inclusion of two different top–down interactions: (i) an interaction that depended on gender congruence across modalities (i.e., difference between face and voice modalities for each stimulus); (ii) an interaction that depended on the within modalities’ gender magnitude. The significance of these interactions was task dependent. Specifically, gender congruence interaction was significant for the face and voice tasks while the gender magnitude interaction was significant for the face and stimulus tasks. Subsequently, we used the same stimuli and related tasks in a functional magnetic resonance imaging (fMRI) paradigm ( N = 12) to explore the neural correlates of these perceptual processes, analyzed with Dynamic Causal Modeling (DCM) and Bayesian Model Selection. Results revealed changes in effective connectivity between the unimodal Fusiform Face Area (FFA) and Temporal Voice Area (TVA) in a fashion that paralleled the face and voice behavioral interactions observed in the psychophysical data. These findings explore the role in perception of multiple unimodal parallel feedback pathways.
more »
« less
- Award ID(s):
- 1724297
- PAR ID:
- 10327420
- Date Published:
- Journal Name:
- Frontiers in Systems Neuroscience
- Volume:
- 15
- ISSN:
- 1662-5137
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Comparing representations of complex stimuli in neural network layers to human brain representations or behavioral judgments can guide model development. However, even qualitatively distinct neural network models often predict similar representational geometries of typical stimulus sets. We propose a Bayesian experimental design approach to synthesizing stimulus sets for adjudicating among representational models efficiently. We apply our method to discriminate among candidate neural network models of behavioral face dissimilarity judgments. Our results indicate that a neural network trained to invert a 3D-face-model graphics renderer is more human-aligned than the same architecture trained on identification, classification, or autoencoding. Our proposed stimulus synthesis objective is generally applicable to designing experiments to be analyzed by representational similarity analysis for model comparison.more » « less
-
Abstract Neurotypical (NT) individuals and individuals with autism spectrum disorder (ASD) make different judgments of social traits from others’ faces; they also exhibit different social emotional responses in social interactions. A common hypothesis is that the differences in face perception in ASD compared with NT is related to distinct social behaviors. To test this hypothesis, we combined a face trait judgment task with a novel interpersonal transgression task that induces measures social emotions and behaviors. ASD and neurotypical participants viewed a large set of naturalistic facial stimuli while judging them on a comprehensive set of social traits (e.g., warm, charismatic, critical). They also completed an interpersonal transgression task where their responsibility in causing an unpleasant outcome to a social partner was manipulated. The purpose of the latter task was to measure participants’ emotional (e.g., guilt) and behavioral (e.g., compensation) responses to interpersonal transgression. We found that, compared with neurotypical participants, ASD participants’ self-reported guilt and compensation tendency was less sensitive to our responsibility manipulation. Importantly, ASD participants and neurotypical participants showed distinct associations between self-reported guilt and judgments of criticalness from others' faces. These findings reveal a novel link between perception of social traits and social emotional responses in ASD.more » « less
-
Abstract Face perception is a fundamental aspect of human social interaction, yet most research on this topic has focused on single modalities and specific aspects of face perception. Here, we present a comprehensive multimodal dataset for examining facial emotion perception and judgment. This dataset includes EEG data from 97 unique neurotypical participants across 8 experiments, fMRI data from 19 neurotypical participants, single-neuron data from 16 neurosurgical patients (22 sessions), eye tracking data from 24 neurotypical participants, behavioral and eye tracking data from 18 participants with ASD and 15 matched controls, and behavioral data from 3 rare patients with focal bilateral amygdala lesions. Notably, participants from all modalities performed the same task. Overall, this multimodal dataset provides a comprehensive exploration of facial emotion perception, emphasizing the importance of integrating multiple modalities to gain a holistic understanding of this complex cognitive process. This dataset serves as a key missing link between human neuroimaging and neurophysiology literature, and facilitates the study of neuropsychiatric populations.more » « less
-
Stationarity perception refers to the ability to accurately perceive the surrounding visual environment as world-fixed during self-motion. Perception of stationarity depends on mechanisms that evaluate the congruence between retinal/oculomotor signals and head movement signals. In a series of psychophysical experiments, we systematically varied the congruence between retinal/oculomotor and head movement signals to find the range of visual gains that is compatible with perception of a stationary environment. On each trial, human subjects wearing a head-mounted display execute a yaw head movement and report whether the visual gain was perceived to be too slow or fast. A psychometric fit to the data across trials reveals the visual gain most compatible with stationarity (a measure of accuracy) and the sensitivity to visual gain manipulation (a measure of precision). Across experiments, we varied 1) the spatial frequency of the visual stimulus, 2) the retinal location of the visual stimulus (central vs. peripheral), and 3) fixation behavior (scene-fixed vs. head-fixed). Stationarity perception is most precise and accurate during scene-fixed fixation. Effects of spatial frequency and retinal stimulus location become evident during head-fixed fixation, when retinal image motion is increased. Virtual Reality sickness assessed using the Simulator Sickness Questionnaire covaries with perceptual performance. Decreased accuracy is associated with an increase in the nausea subscore, while decreased precision is associated with an increase in the oculomotor and disorientation subscores.more » « less
An official website of the United States government

