skip to main content


Title: JECL: Joint Embedding and Cluster Learning for Image-Text Pairs
We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes.  more » « less
Award ID(s):
1934405 1740996
PAR ID:
10293147
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 25th International Conference on Pattern Recognition (ICPR)
Page Range / eLocation ID:
8344 to 8351
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Training a deep learning model with a large annotated dataset is still a dominant paradigm in automatic whole slide images (WSIs) processing for digital pathology. However, obtaining manual annotations is a labor-intensive task, and an error-prone to inter and intra-observer variability. In this study, we offer an online deep learning-based clustering workflow for annotating and analysis of different types of tissues from histopathology images. Inspired by learning and optimal transport theory, our proposed model consists of two stages. In the first stage, our model learns tissue-specific discriminative representations by contrasting the features in the latent space at two levels, the instance- and the clusterlevel. This is done by maximizing the similarities of the projections of positive pairs (views of the same image) while minimizing those of negative ones (views of the rest of the images). In the second stage, our framework extends the standard cross-entropy minimization to an optimal transport problem and solves it using the Sinkhorn-Knopp algorithm to produce the cluster assignments. Moreover, our proposed method enforces consistency between the produced assignments obtained from views of the same image. Our framework was evaluated on three common histopathological datasets: NCT-CRC, LC2500, and Kather STAD. Experiments show that our proposed framework can identify different tissues in annotation-free conditions with competitive results. It achieved an accuracy of 0.9364 in human lung patched WSIs and 0.8464 in images of human colorectal tissues outperforming state of the arts contrastive-based methods. 
    more » « less
  2. The information bottleneck (IB) approach to clustering takes a joint distribution [Formula: see text] and maps the data [Formula: see text] to cluster labels [Formula: see text], which retain maximal information about [Formula: see text] (Tishby, Pereira, & Bialek, 1999 ). This objective results in an algorithm that clusters data points based on the similarity of their conditional distributions [Formula: see text]. This is in contrast to classic geometric clustering algorithms such as [Formula: see text]-means and gaussian mixture models (GMMs), which take a set of observed data points [Formula: see text] and cluster them based on their geometric (typically Euclidean) distance from one another. Here, we show how to use the deterministic information bottleneck (DIB) (Strouse & Schwab, 2017 ), a variant of IB, to perform geometric clustering by choosing cluster labels that preserve information about data point location on a smoothed data set. We also introduce a novel intuitive method to choose the number of clusters via kinks in the information curve. We apply this approach to a variety of simple clustering problems, showing that DIB with our model selection procedure recovers the generative cluster labels. We also show that, in particular limits of our model parameters, clustering with DIB and IB is equivalent to [Formula: see text]-means and EM fitting of a GMM with hard and soft assignments, respectively. Thus, clustering with (D)IB generalizes and provides an information-theoretic perspective on these classic algorithms. 
    more » « less
  3. Generative artificial intelligence has made significant strides, producing text indistinguishable from human prose and remarkably photorealistic images. Automatically measuring how close the generated data distribution is to the target distribution is central to diagnosing existing models and developing better ones. We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images. These scores are statistical summaries of divergence frontiers capturing two types of errors in generative modeling. We explore three approaches to statistically estimate these scores: vector quantization, non-parametric estimation, and classifier-based estimation. We provide statistical bounds for the vector quantization approach. Empirically, we find that the proposed scores paired with a range of f -divergences and statistical estimation methods can quantify the gaps between the distributions of human-written text and those of modern neural language models by correlating with human judgments and identifying known properties of the generated texts. We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics. In conclusion, we present practical recommendations for using MAUVE effectively with language and image modalities. 
    more » « less
  4. null (Ed.)
    Recent work on explainable clustering allows describing clusters when the features are interpretable. However, much modern machine learning focuses on complex data such as images, text, and graphs where deep learning is used but the raw features of data are not interpretable. This paper explores a novel setting for performing clustering on complex data while simultaneously generating explanations using interpretable tags. We propose deep descriptive clustering that performs sub-symbolic representation learning on complex data while generating explanations based on symbolic data. We form good clusters by maximizing the mutual information between empirical distribution on the inputs and the induced clustering labels for clustering objectives. We generate explanations by solving an integer linear programming that generates concise and orthogonal descriptions for each cluster. Finally, we allow the explanation to inform better clustering by proposing a novel pairwise loss with self-generated constraints to maximize the clustering and explanation module's consistency. Experimental results on public data demonstrate that our model outperforms competitive baselines in clustering performance while offering high-quality cluster-level explanations. 
    more » « less
  5. People are able to describe images using thousands of languages, but languages share only one visual world. The aim of this work is to use the learned intermediate visual representations from a deep convolutional neural network to transfer information across languages for which paired data is not available in any form. Our work proposes using backpropagation-based decoding coupled with transformer-based multilingual-multimodal language models in order to obtain translations between any languages used during training. We particularly show the capabilities of this approach in the translation of German-Japanese and Japanese-German sentence pairs, given a training data of images freely associated with text in English, German, and Japanese but for which no single image contains annotations in both Japanese and German. Moreover, we demonstrate that our approach is also generally useful in the multilingual image captioning task when sentences in a second language are available at test time. The results of our method also compare favorably in the Multi30k dataset against recently proposed methods that are also aiming to leverage images as an intermediate source of translations. 
    more » « less