skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 5, 2026

Title: Deep Cluster Distribution Alignment in Source-Target Domain Adaptation
Classification models trained on data from one source may underperform when tested on data acquired from different sources due to shifts in data distributions, which limit the models’ generalizability in real-world applications. Domain adaptation methods proposed to align such shifts in source-target data distributions use contrastive learning or adversarial techniques with or without internal cluster alignment. The intracluster alignment is performed using standalone k-means clustering on image embedding. This paper introduces a novel deep clustering approach to align cluster distributions in tandem with adapting source and target data distributions. Our method learns and aligns a mixture of cluster distributions in the unlabeled target domain with those in the source domain in a unified deep representation learning framework. Experiments demonstrate that intra-cluster alignment improves classification accuracy in nine out of ten domain adaptation examples. These improvements range between 0.3% and 2.0% compared to k-means clustering of embedding and between 0.4% and 5.8% compared to methods without class-level alignment. Unlike current domain adaptation methods, the proposed cluster distribution-based deep learning provides a quantitative and explainable measure of distribution shifts in data domains. We have publicly shared the source code for the algorithm implementation.  more » « less
Award ID(s):
2431058
PAR ID:
10645191
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
1 to 8
Subject(s) / Keyword(s):
Domain adaptation, deep clustering, contrastive learning, KL divergence, mixture of distributions
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Traditional unsupervised domain adaptation methods attempt to align source and target domains globally and are agnostic to the categories of the data points. This results in an inaccurate categorical alignment and diminishes the classification performance on the target domain. In this paper, we alter existing adversarial domain alignment methods to adhere to category alignment by imputing category information. We partition the samples based on category using source labels and target pseudo labels and then apply domain alignment for every category. Our proposed modification provides a boost in performance even with a modest pseudo label estimator. We evaluate our approach on 4 popular domain alignment loss functions using object recognition and digit datasets. 
    more » « less
  2. Recent advancements in deep learning-based wearable human action recognition (wHAR) have improved the capture and classification of complex motions, but adoption remains limited due to the lack of expert annotations and domain discrepancies from user variations. Limited annotations hinder the model's ability to generalize to out-of-distribution samples. While data augmentation can improve generalizability, unsupervised augmentation techniques must be applied carefully to avoid introducing noise. Unsupervised domain adaptation (UDA) addresses domain discrepancies by aligning conditional distributions with labeled target samples, but vanilla pseudo-labeling can lead to error propagation. To address these challenges, we propose μDAR, a novel joint optimization architecture comprised of three functions: (i) consistency regularizer between augmented samples to improve model classification generalizability, (ii) temporal ensemble for robust pseudo-label generation and (iii) conditional distribution alignment to improve domain generalizability. The temporal ensemble works by aggregating predictions from past epochs to smooth out noisy pseudo-label predictions, which are then used in the conditional distribution alignment module to minimize kernel-based class-wise conditional maximum mean discrepancy (kCMMD) between the source and target feature space to learn a domain invariant embedding. The consistency-regularized augmentations ensure that multiple augmentations of the same sample share the same labels; this results in (a) strong generalization with limited source domain samples and (b) consistent pseudo-label generation in target samples. The novel integration of these three modules in μDAR results in a range of ~ 4-12% average macro-F1 score improvement over six state-of-the-art UDA methods in four benchmark wHAR datasets. 
    more » « less
  3. The scarcity of labeled data has traditionally been the primary hindrance in building scalable supervised deep learning models that can retain adequate performance in the presence of various heterogeneities in sample distributions. Domain adaptation tries to address this issue by adapting features learned from a smaller set of labeled samples to that of the incoming unlabeled samples. The traditional domain adaptation approaches normally consider only a single source of labeled samples, but in real world use cases, labeled samples can originate from multiple-sources – providing motivation for multi-source domain adaptation (MSDA). Several MSDA approaches have been investigated for wearable sensor-based human activity recognition (HAR) in recent times, but their performance improvement compared to single source counterpart remained marginal. To remedy this performance gap that, we explore multiple avenues to align the conditional distributions in addition to the usual alignment of marginal ones. In our investigation, we extend an existing multi-source domain adaptation approach under semi-supervised settings. We assume the availability of partially labeled target domain data and further explore the pseudo labeling usage with a goal to achieve a performance similar to the former. In our experiments on three publicly available datasets, we find that a limited labeled target domain data and pseudo label data boost the performance over the unsupervised approach by 10-35% and 2-6%, respectively, in various domain adaptation scenarios. 
    more » « less
  4. Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms. 
    more » « less
  5. null (Ed.)
    Domain adaptation methods have been introduced for auto-filtering disaster tweets to address the issue of lacking labeled data for an emerging disaster. In this article, the authors present and compare two simple, yet effective approaches for the task of classifying disaster-related tweets. The first approach leverages the unlabeled target disaster data to align the source disaster distribution to the target distribution, and, subsequently, learns a supervised classifier from the modified source data. The second approach uses the strategy of self-training to iteratively label the available unlabeled target data, and then builds a classifier as a weighted combination of source and target-specific classifiers. Experimental results using Naïve Bayes as the base classifier show that both approaches generally improve performance as compared to baseline. Overall, the self-training approach gives better results than the alignment-based approach. Furthermore, combining correlation alignment with self-training leads to better result, but the results of self-training are still better. 
    more » « less