- Award ID(s):
- 1943251
- PAR ID:
- 10213693
- Date Published:
- Journal Name:
- IEEE journal of selected topics in signal processing
- ISSN:
- 1932-4553
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Humans use different modalities, such as speech, text, images, videos, etc., to communicate their intent and goals with teammates. For robots to become better assistants, we aim to endow them with the ability to follow instructions and understand tasks specified by their human partners. Most robotic policy learning methods have focused on one single modality of task specification while ignoring the rich cross-modal information. We present MUTEX, a unified approach to policy learning from multimodal task specifications. It trains a transformer-based architecture to facilitate cross-modal reasoning, combining masked modeling and cross-modal matching objectives in a two-stage training procedure. After training, MUTEX can follow a task specification in any of the six learned modalities (video demonstrations, goal images, text goal descriptions, text instructions, speech goal descriptions, and speech instructions) or a combination of them. We systematically evaluate the benefits of MUTEX in a newly designed dataset with 100 tasks in simulation and 50 tasks in the real world, annotated with multiple instances of task specifications in different modalities, and observe improved performance over methods trained specifically for any single modality.more » « less
-
Abstract Current biotechnologies can simultaneously measure multiple high-dimensional modalities (e.g., RNA, DNA accessibility, and protein) from the same cells. A combination of different analytical tasks (e.g., multi-modal integration and cross-modal analysis) is required to comprehensively understand such data, inferring how gene regulation drives biological diversity and functions. However, current analytical methods are designed to perform a single task, only providing a partial picture of the multi-modal data. Here, we present UnitedNet, an explainable multi-task deep neural network capable of integrating different tasks to analyze single-cell multi-modality data. Applied to various multi-modality datasets (e.g., Patch-seq, multiome ATAC + gene expression, and spatial transcriptomics), UnitedNet demonstrates similar or better accuracy in multi-modal integration and cross-modal prediction compared with state-of-the-art methods. Moreover, by dissecting the trained UnitedNet with the explainable machine learning algorithm, we can directly quantify the relationship between gene expression and other modalities with cell-type specificity. UnitedNet is a comprehensive end-to-end framework that could be broadly applicable to single-cell multi-modality biology. This framework has the potential to facilitate the discovery of cell-type-specific regulation kinetics across transcriptomics and other modalities.
-
Proteins, often represented as multi-modal data of 1D sequences and 2D/3D structures, provide a motivating example for the communities of machine learning and computational biology to advance multi-modal representation learning. Protein language models over sequences and geometric deep learning over structures learn excellent single-modality representations for downstream tasks. It is thus desirable to fuse the single-modality models for better representation learning, but it remains an open question on how to fuse them effectively into multi-modal representation learning with a modest computational cost yet significant downstream performance gain. To answer the question, we propose to make use of separately pretrained single-modality models, integrate them in parallel connections, and continuously pretrain them end-to-end under the framework of multimodal contrastive learning. The technical challenge is to construct views for both intra- and inter-modality contrasts while addressing the heterogeneity of various modalities, particularly various levels of semantic robustness. We address the challenge by using domain knowledge of protein homology to inform the design of positive views, specifically protein classifications of families (based on similarities in sequences) and superfamilies (based on similarities in structures). We also assess the use of such views compared to, together with, and composed to other positive views such as identity and cropping. Extensive experiments on enzyme classification and protein function prediction benchmarks demonstrate the potential of domain-informed view construction and combination in multi-modal contrastive learningmore » « less
-
Mild cognitive impairment is the prodromal stage of Alzheimer’s disease. Its detection has been a critical task for establishing cohort studies and developing therapeutic interventions for Alzheimer’s. Various types of markers have been developed for detection. For example, imaging markers from neuroimaging have shown great sensitivity, while its cost is still prohibitive for large-scale screening of early dementia. Recent advances from digital biomarkers, such as language markers, have provided an accessible and affordable alternative. While imaging markers give anatomical descriptions of the brain, language markers capture the behavior characteristics of early dementia subjects. Such differences suggest the benefits of auxiliary information from the imaging modality to improve the predictive power of unimodal predictive models based on language markers alone. However, one significant barrier to the joint analysis is that in typical cohorts, there are only very limited subjects that have both imaging and language modalities. To tackle this challenge, in this paper, we develop a novel crossmodal augmentation tool, which leverages auxiliary imaging information to improve the feature space of language markers so that a subject with only language markers can benefit from imaging information through the augmentation. Our experimental results show that the multi-modal predictive model trained with language markers and auxiliary imaging information significantly outperforms unimodal predictive models.more » « less
-
Abstract Conceptual design is the foundational stage of a design process, translating ill-defined design problems to low-fidelity design concepts and prototypes. While deep learning approaches are widely applied in later design stages for design automation, we see fewer attempts in conceptual design for three reasons: 1) the data in this stage exhibit multiple modalities: natural language, sketches, and 3D shapes, and these modalities are challenging to represent in deep learning methods; 2) it requires knowledge from a larger source of inspiration instead of focusing on a single design task; and 3) it requires translating designers’ intent and feedback, and hence needs more interaction with designers and/or users. With recent advances in deep learning of cross-modal tasks (DLCMT) and the availability of large cross-modal datasets, we see opportunities to apply these learning methods to the conceptual design of product shapes. In this paper, we review 30 recent journal articles and conference papers across computer graphics, computer vision, and engineering design fields that involve DLCMT of three modalities: natural language, sketches, and 3D shapes. Based on the review, we identify the challenges and opportunities of utilizing DLCMT in 3D shape concepts generation, from which we propose a list of research questions pointing to future research directions.