skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: Contrastive Learning with Data Misalignment: Feature Purity, Training Dynamics and Theoretical Generalization Guarantees
Contrastive learning is a powerful framework for learning discriminative representations from image-text pairs. Despite its success, its theoretical foundations, especially when the image-text pair exhibits misalignment, remain underexplored. This paper provides the first theoretical analysis of contrastive learning under data misalignment, proving how the ground-truth modality-paired features are amplified while spurious features are suppressed through the training dynamics analysis. Specifically, we study two nonlinear encoders trained jointly with a contrastive loss and demonstrate that noisy (or misaligned) data pairs result in mixed representations and degrade the model's generalization ability. In contrast, recaptioning and filtering improve the data alignment, which in turn purifies the features learned by neurons and subsequently enhances generalization. Our analysis identifies feature purity as a key factor in the success of contrastive learning and offers insights into how data quality and training procedures impact representation learning and downstream generalization. Theoretical insights are supported by experiments on standard benchmarks.  more » « less
Award ID(s):
2349879
PAR ID:
10663177
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
The Thirty-Ninth Annual Conference on Neural Information Processing Systems
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language–Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms of multimodal learning remain poorly understood. Notably, these models often exhibit a \emph{modality gap}, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in causing and perpetuating the modality gap during training. Furthermore, our theoretical insights are validated through experiments on practical CLIP models. These findings provide principled guidance for mitigating the modality gap, including strategies such as appropriate temperature scheduling and modality swapping. Additionally, we demonstrate that closing the modality gap leads to improved performance on tasks such as image-text retrieval. 
    more » « less
  2. Contrastive learning has served as a powerful framework in the early development of vision–language models (VLMs), demonstrating remarkable effectiveness in learning generalizable representations and establishing itself as the foundation for many state-of-the-art systems. However, despite these advances, its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world settings. Such imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is still lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve differently across three stages of training, with distinct dynamics for majority features, minority features, and the noise. We further show that minority features diminish neurons’ representational capacity, increase the need for more complex architectures, and impair the separation of ground-truth features from noise. These findings offer new theoretical insights into how data imbalance shapes learning in contrastive frameworks and serve as an early step towards principled modifications for developing more robust and unbiased representations. 
    more » « less
  3. Contrastive learning has served as a powerful framework in the early development of vision–language models (VLMs), demonstrating remarkable effectiveness in learning generalizable representations and establishing itself as the foundation for many state-of-the-art systems. However, despite these advances, its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world settings. Such imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is still lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve differently across three stages of training, with distinct dynamics for majority features, minority features, and the noise. We further show that minority features diminish neurons’ representational capacity, increase the need for more complex architectures, and impair the separation of ground-truth features from noise. These findings offer new theoretical insights into how data imbalance shapes learning in contrastive frameworks and serve as an early step towards principled modifications for developing more robust and unbiased representations. 
    more » « less
  4. Unsupervised learning has recently made exceptional progress because of the development of more effective contrastive learning methods. However, CNNs are prone to depend on low-level features that humans deem non-semantic. This dependency has been conjectured to induce a lack of robustness to image perturbations or domain shift. In this paper, we show that by generating carefully designed negative samples, contrastive learning can learn more robust representations with less dependence on such features. Contrastive learning utilizes positive pairs which preserve semantic information while perturbing superficial features in the training images. Similarly, we propose to generate negative samples in a reversed way, where only the superfluous instead of the semantic features are preserved. We develop two methods, texture-based and patch-based augmentations, to generate negative samples. These samples achieve better generalization, especially under out-of-domain settings. We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes and especially finer classes. We also show that the model bias between texture and shape features favors them differently under different test settings. 
    more » « less
  5. null (Ed.)
    While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e.g., BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question rises: why they do not collapse into trivial representation? In this paper, we answer this question via a simple theoretical study and propose a novel approach, \ourmethod{}, that \emph{directly} sets the linear predictor based on the statistics of its inputs, rather than trained with gradient update. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms linear predictor by 2.5 in 300-epoch training (and 5 in 60-epoch). \ourmethod{} is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released. 
    more » « less