Enhancing Contrastive Representation Learning through Data

Zhang, Honggen

Contrastive learning learns input representation by pushing similar data together and pulling dissimilar data away, along with data augmentation and pretext task construction. It enhances the large model learning due to its ability to use a large amount of unlabeled data. It has been suc- cessfully applied to large language models, pre-trained image models, and multimodal models. In addition, contrastive learning learns a representation from modeling the explainable structure of the latent space, which has a broad application in scientific discovery and interpretable Artificial Intelligence (AI). The primary focus of this thesis is to explore contrastive learning from a data construction perspective in real-world problems to fill the gap between the principle of contrastive learning and its application. The challenges, such as sampling bias and data quality, will largely affect the representations learned by contrastive learning. This thesis analyzes the data construction chanlledges and limitations in 1) the negative sampling of knowledge graph embedding (KGE), 2) high-quliaty preference data labeling of Large Language Models (LLMs) alignment, 3) data augmentation in Non-linear dynamic system modeling, and 4) data properties in functions of mesange RNA (mRNA) sequence. To solve the challenges 1), a hardness and structure-based objective function was proposed by considering sampling bias in hard negative sampling. For challenge 2), the similarity of response embedding is used to evaluate the quality of preference pairs to mitigate the labeling error of humans when they face an ambiguous response pair. Chal- lenge 3) is solved by systematically considering the physical system and contrastive learning. A data augmentation strategy by partitioning the full sequence is used for learning the transition matrix in the latent linear space. Challenge 4) is common to see in the biological domain due to the high cost of lab experiments. Pre-trained model will advantage the limited dataset su- pervised learning by learning general features from domain knowledge. A contrastive learning based teacher-student framework is proposed for mRNA sequence learning by contrasting the unmasked sequence and the hard-masked sequence. By providing careful data construction or data sampling, contrastive learning will be boosted to solve tasks in reality. For the KGE, the novel contrastive loss function learns the boundary between negative samples and positive samples to improve the link prediction task in the knowl- edge graph; For the LLM alignment, in the same labeling cost, the selected dissimilar responses will improve the vanilla direct preference optimization (DPO) alignment; The data augmentation with contrastive loss play crucial role to learn more accuracy dynamic system, which explained by the learned the continiues eigenfunction; By considering the tearch-student framework with hard-masked strategy, the pre-trained model achieve the state-of-the-art result by fine-tuning on limited downstrame task data. Overall, this thesis provides a broad data-driven contrastive learning methodology to enhance representation learning in different domains. The methodology consists of a imprived objective function in the face of data bias, a better data selection reducing labeling error, and proper data augmentation for a particular application domain. This methodology improve the learning result compare to traditional method.

More Like this