skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 1, 2026

Title: Unsupervised physics-informed disentanglement of multimodal data
Abstract. We introduce physics-informed multimodal autoencoders (PIMA) - a variational inference framework for discovering shared information in multimodal datasets. Individual modalities are embedded into a shared latent space and fused through a product-of-experts formulation, enabling a Gaussian mixture prior to identify shared features. Sampling from clusters allows cross-modal generative modeling, with a mixture-of-experts decoder that imposes inductive biases from prior scientific knowledge and thereby imparts structured disentanglement of the latent space. This approach enables cross-modal inference and the discovery of features in high-dimensional heterogeneous datasets. Consequently, this approach provides a means to discover fingerprints in multimodal scientific datasets and to avoid traditional bottlenecks related to high-fidelity measurement and characterization of scientific  more » « less
Award ID(s):
1934367
PAR ID:
10558871
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
American Institute of Mathematical Sciences
Date Published:
Journal Name:
Foundations of Data Science
Volume:
7
Issue:
1
ISSN:
2639-8001
Page Range / eLocation ID:
418 to 445
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT With the increasing availability of large‐scale multimodal neuroimaging datasets, it is necessary to develop data fusion methods which can extract cross‐modal features. A general framework, multidataset independent subspace analysis (MISA), has been developed to encompass multiple blind source separation approaches and identify linked cross‐modal sources in multiple datasets. In this work, we utilized the multimodal independent vector analysis (MMIVA) model in MISA to directly identify meaningful linked features across three neuroimaging modalities—structural magnetic resonance imaging (MRI), resting state functional MRI and diffusion MRI—in two large independent datasets, one comprising of control subjects and the other including patients with schizophrenia. Results show several linked subject profiles (sources) that capture age‐associated decline, schizophrenia‐related biomarkers, sex effects, and cognitive performance. For sources associated with age, both shared and modality‐specific brain‐age deltas were evaluated for association with non‐imaging variables. In addition, each set of linked sources reveals a corresponding set of cross‐modal spatial patterns that can be studied jointly. We demonstrate that the MMIVA fusion model can identify linked sources across multiple modalities, and that at least one set of linked, age‐related sources replicates across two independent and separately analyzed datasets. The same set also presented age‐adjusted group differences, with schizophrenia patients indicating lower multimodal source levels. Linked sets associated with sex and cognition are also reported for the UK Biobank dataset. 
    more » « less
  2. In multimodal machine learning, effectively addressing the missing modality scenario is crucial for improving performance in downstream tasks such as in medical contexts where data may be incomplete. Although some attempts have been made to retrieve embeddings for missing modalities, two main bottlenecks remain: (1) the need to consider both intra- and inter-modal context, and (2) the cost of embedding selection, where embeddings often lack modality-specific knowledge. To address this, the authors propose MoE-Retriever, a novel framework inspired by Sparse Mixture of Experts (SMoE). MoE-Retriever defines a supporting group for intra-modal inputs—samples that commonly lack the target modality—by selecting samples with complementary modality combinations for the target modality. This group is integrated with inter-modal inputs from different modalities of the same sample, establishing both intra- and inter-modal contexts. These inputs are processed by Multi-Head Attention to generate context-aware embeddings, which serve as inputs to the SMoE Router that automatically selects the most relevant experts (embedding candidates). Comprehensive experiments on both medical and general multimodal datasets demonstrate the robustness and generalizability of MoE-Retriever, marking a significant step forward in embedding retrieval methods for incomplete multimodal data. 
    more » « less
  3. Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called Husformer. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our Husformer outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in Husformer. Experimental details and source code are available at https://github.com/SMARTlab-Purdue/Husformer. 
    more » « less
  4. Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of efficiently improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo, which incorporates Co-upcycled Top-K sparsely-gated Mixtureof-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with neglectable additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage, with auxiliary losses to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks within each model size group, all while training exclusively on open-sourced datasets. 
    more » « less
  5. Learning multimodal representations is a fundamentally complex research problem due to the presence of multiple heterogeneous sources of information. Although the presence of multiple modalities provides additional valuable information, there are two key challenges to address when learning from multimodal data: 1) models must learn the complex intra-modal and cross-modal interactions for prediction and 2) models must be robust to unexpected missing or noisy modalities during testing. In this paper, we propose to optimize for a joint generative-discriminative objective across multimodal data and labels. We introduce a model that factorizes representations into two sets of independent factors: multimodal discriminative and modality-specific generative factors. Multimodal discriminative factors are shared across all modalities and contain joint multimodal features required for discriminative tasks such as sentiment prediction. Modality-specific generative factors are unique for each modality and contain the information required for generating data. Experimental results show that our model is able to learn meaningful multimodal representations that achieve state-of-the-art or competitive performance on six multimodal datasets. Our model demonstrates flexible generative capabilities by conditioning on independent factors and can reconstruct missing modalities without significantly impacting performance. Lastly, we interpret our factorized representations to understand the interactions that influence multimodal learning. 
    more » « less