skip to main content

Title: An Efficient Approach to Informative Feature Extraction from Multimodal Data
One primary focus in multimodal feature extraction is to find the representations of individual modalities that are maximally correlated. As a well-known measure of dependence, the Hirschfeld-Gebelein-Rényi (HGR) maximal correlation be-´ comes an appealing objective because of its operational meaning and desirable properties. However, the strict whitening constraints formalized in the HGR maximal correlation limit its application. To address this problem, this paper proposes Soft-HGR, a novel framework to extract informative features from multiple data modalities. Specifically, our framework prevents the “hard” whitening constraints, while simultaneously preserving the same feature geometry as in the HGR maximal correlation. The objective of Soft-HGR is straightforward, only involving two inner products, which guarantees the efficiency and stability in optimization. We further generalize the framework to handle more than two modalities and missing modalities. When labels are partially available, we enhance the discriminative power of the feature representations by making a semi-supervised adaptation. Empirical evaluation implies that our approach learns more informative feature mappings and is more efficient to optimize.
; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Page Range or eLocation-ID:
5281 to 5288
Sponsoring Org:
National Science Foundation
More Like this
  1. Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures thatmore »our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translationprediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICTMMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities.« less
  2. We show that bringing intermediate layers' representations of two augmented versions of an image closer together in self-supervised learning helps to improve the momentum contrastive (MoCo) method. To this end, in addition to the contrastive loss, we minimize the mean squared error between the intermediate layer representations or make their cross-correlation matrix closer to an identity matrix. Both loss objectives either outperform standard MoCo, or achieve similar performances on three diverse medical imaging datasets: NIH-Chest Xrays, Breast Cancer Histopathology, and Diabetic Retinopathy. The gains of the improved MoCo are especially large in a low-labeled data regime (e.g. 1% labeled data) with an average gain of 5% across three datasets. We analyze the models trained using our novel approach via feature similarity analysis and layer-wise probing. Our analysis reveals that models trained via our approach have higher feature reuse compared to a standard MoCo and learn informative features earlier in the network. Finally, by comparing the output probability distribution of models fine-tuned on small versus large labeled data, we conclude that our proposed method of pre-training leads to lower Kolmogorov-Smirnov distance, as compared to a standard MoCo. This provides additional evidence that our proposed method learns more informative features in themore »pre-training phase which could be leveraged in a low-labeled data regime.« less
  3. Background and Objectives: Prediction of decline to dementia using objective biomarkers in high-risk patients with amnestic mild cognitive impairment (aMCI) has immense utility. Our objective was to use multimodal MRI to (1) determine whether accurate and precise prediction of dementia conversion could be achieved using baseline data alone, and (2) generate a map of the brain regions implicated in longitudinal decline to dementia. Methods: Participants meeting criteria for aMCI at baseline ( N = 55) were classified at follow-up as remaining stable/improved in their diagnosis ( N = 41) or declined to dementia ( N = 14). Baseline T1 structural MRI and resting-state fMRI (rsfMRI) were combined and a semi-supervised support vector machine (SVM) which separated stable participants from those who decline at follow-up with maximal margin. Cross-validated model performance metrics and MRI feature weights were calculated to include the strength of each brain voxel in its ability to distinguish the two groups. Results: Total model accuracy for predicting diagnostic change at follow-up was 92.7% using baseline T1 imaging alone, 83.5% using rsfMRI alone, and 94.5% when combining T1 and rsfMRI modalities. Feature weights that survived the p < 0.01 threshold for separation of the two groups revealed the strongestmore »margin in the combined structural and functional regions underlying the medial temporal lobes in the limbic system. Discussion: An MRI-driven SVM model demonstrates accurate and precise prediction of later dementia conversion in aMCI patients. The multi-modal regions driving this prediction were the strongest in the medial temporal regions of the limbic system, consistent with literature on the progression of Alzheimer’s disease.« less
  4. The advent of deep learning algorithms for mobile devices and sensors has led to a dramatic expansion in the availability and number of systems trained on a wide range of machine learning tasks, creating a host of opportunities and challenges in the realm of transfer learning. Currently, most transfer learning methods require some kind of control over the systems learned, either by enforcing constraints dur- ing the source training, or through the use of a joint optimization objective between tasks that requires all data be co-located for training. However, for practical, pri- vacy, or other reasons, in a variety of applications we may have no control over the individual source task training, nor access to source training samples. Instead we only have access to features pre-trained on such data as the output of “black-boxes.” For such scenarios, we consider the multi-source learning problem of training a classifier using an ensemble of pre-trained neural networks for a set of classes that have not been observed by any of the source networks, and for which we have very few training samples. We show that by using these distributed networks as feature extractors, we can train an effective classifier in a computationally-efficient mannermore »using tools from (nonlinear) maximal correlation analysis. In particular, we develop a method we refer to as maximal correlation weighting (MCW) to build the required target classifier from an appropriate weighting of the feature functions from the source networks. We illustrate the effectiveness of the resulting classi- fier on datasets derived from the CIFAR-100, Stanford Dogs, and Tiny ImageNet datasets, and, in addition, use the methodology to characterize the relative value of different source tasks in learning a target task.« less
  5. Modern graph or network datasets often contain rich structure that goes beyond simple pairwise connections between nodes. This calls for complex representations that can capture, for instance, edges of different types as well as so-called “higher-order interactions” that involve more than two nodes at a time. However, we have fewer rigorous methods that can provide insight from such representations. Here, we develop a computational framework for the problem of clustering hypergraphs with categorical edge labels — or different interaction types — where clusters corresponds to groups of nodes that frequently participate in the same type of interaction. Our methodology is based on a combinatorial objective function that is related to correlation clustering on graphs but enables the design of much more efficient algorithms that also seamlessly generalize to hypergraphs. When there are only two label types, our objective can be optimized in polynomial time, using an algorithm based on minimum cuts. Minimizing our objective becomes NP-hard with more than two label types, but we develop fast approximation algorithms based on linear programming relaxations that have theoretical cluster quality guarantees. We demonstrate the efficacy of our algorithms and the scope of the model through problems in edge-label community detection, clustering withmore »temporal data, and exploratory data analysis.« less