<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>06/10/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10632064</idno>
					<idno type="doi">10.1109/CVPR52734.2025.00917</idno>
					
					<author>Hritam Basak</author><author>Zhaozheng Yin</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Domain Adaptation (DA) and Semi-supervised Learning (SSL) converge in Semi-supervised Domain Adaptation  (SSDA), where the objective is to transfer knowledge from a source domain to a target domain using a combination of limited labeled target samples and abundant unlabeled target data. Although intuitive, a simple amalgamation of DA and SSL is suboptimal in semantic segmentation due to two major reasons: (1) previous methods, while able to learn good segmentation boundaries, are prone to confuse classes with similar visual appearance due to limited supervision; and (2) skewed and imbalanced training data distribution preferring source representation learning whereas impeding from exploring limited information about tailed classes. Language guidance can serve as a pivotal semantic bridge, facilitating robust class discrimination and mitigating visual ambiguities by leveraging the rich semantic relationships encoded in pre-trained language models to enhance feature representations across domains. Therefore, we propose the first language-guided SSDA setting for semantic segmentation in this work. Specifically, we harness the semantic generalization capabilities inherent in vision-language models (VLMs) to establish a synergistic framework within the SSDA paradigm. To address the inherent class-imbalance challenges in long-tailed distributions, we introduce class-balanced segmentation loss formulations that effectively regularize the learning process. Through extensive experimentation across diverse domain adaptation scenarios, our approach demonstrates substantial performance improvements over contemporary state-ofthe-art (SoTA) methodologies. Code is available: GitHub.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The remarkable progress in deep learning has significantly enhanced the performance of visual understanding tasks, including image classification <ref type="bibr">[7]</ref>, object detection <ref type="bibr">[103]</ref>, and, more recently, semantic segmentation <ref type="bibr">[61,</ref><ref type="bibr">71]</ref>. These advancements have been particularly notable when a wealth of labeled training data is available. However, as noted in <ref type="bibr">[68]</ref>, their performance degrades precipitously when confronted with annotation-scarce environments, especially in the context of semantic segmentation, where dense pixelwise annotations are essential. Furthermore, these sophisticated models exhibit substantial vulnerability when tasked with generalizing across domains characterized by significant distributional shifts <ref type="bibr">[30,</ref><ref type="bibr">36]</ref> -a challenge particularly evident in real-world applications where models trained on synthetic data must maintain robust performance in naturalistic settings, such as autonomous navigation systems <ref type="bibr">[19,</ref><ref type="bibr">66]</ref>. This inherent limitation in cross-domain generalization has catalyzed the emergence of two pivotal research paradigms: Domain Adaptation (DA) and Semi-supervised Learning (SSL).</p><p>The confluence of DA and SSL has given rise to Semisupervised Domain Adaptation (SSDA), a hybrid approach that strategically leverages three distinct data streams: comprehensively labeled source domain data, sparsely labeled target domain samples, and a wealth of unlabeled target domain instances <ref type="bibr">[42,</ref><ref type="bibr">65]</ref>. While SSDA holds intuitive appeal for real-world applications, existing methods encounter critical limitations when applied to semantic segmentation tasks. Specifically, (1) despite achieving accurate segmentation boundaries, current approaches <ref type="bibr">[52,</ref><ref type="bibr">92]</ref> often suffer from misclassification among visually similar classes, due to restricted supervision within the target domain; <ref type="bibr">(2)</ref> the SSDA framework tends to over-prioritize source domain features, driven by abundant source labels, while generating error-prone pseudo-labels for target data, which hampers adaptation performance; (3) class-imbalance, a common issue in real-world datasets, exacerbates these challenges, limiting effective exploration and representation of minority (tail) classes in the target domain.</p><p>To address the identified SSDA challenges, we augment the SSDA paradigm with vision-language (VL) guidance using VLMs (e.g., CLIP <ref type="bibr">[60]</ref>) to enrich semantic representation, leveraging their large-scale image-caption pretraining. By incorporating VLM features into a global-local context exploration module, we mitigate misclassification among visually similar classes. To tackle the over-reliance on source features, we introduce a joint embedding space guided by language priors, enhancing instance separability and reducing domain bias, unlike traditional divergence-based alignment methods <ref type="bibr">[41,</ref><ref type="bibr">94]</ref>. Finally, to combat class imbalance, we design a tailored cross-entropy loss that dynamically reweighs minority classes, thereby facilitating more equitable exploration and representation of tail classes in the target domain. Specifically, our contributions can be summarized as:</p><p>1. Language-Guided SSDA Framework: We pioneer the first language-guided SSDA framework for semantic segmentation by harnessing the rich semantic knowledge encoded in pre-trained Vision-Language Models (VLMs). Our novel attention-based fusion mechanism seamlessly integrates visual features with dense language embeddings, establishing a robust semantic bridge between source-target domains while providing enhanced contextual understanding. 2. Enhanced Feature Localization: Recognizing that VL pre-training primarily operates at the image level, we address the critical challenge of feature localization in semantic segmentation through targeted fine-tuning. To mitigate the risks of overfitting and semantic knowledge degradation inherent in limited-annotation scenarios, we develop a sophisticated consistency regularization framework that preserves the rich semantic representations acquired during pre-training. 3. Adaptive Class-Balanced Loss: To tackle class imbalance in a limited annotation scenario, we introduce a Dynamic Cross-Entropy (DyCE) loss formulation that dynamically calibrates the learning emphasis on tail classes. This innovative, plug-and-play loss mechanism demonstrates broad applicability across various classimbalanced learning scenarios. 4. State-of-the-Art Performance: Through detailed evaluation across diverse domain-adaptive and classimbalanced segmentation benchmarks, our methodology demonstrates superior performance and robustness, consistently surpassing contemporary state-of-the-art approaches by significant margins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Semi-supervised Domain Adaptation</head><p>Recent advances in Semi-Supervised Domain Adaptation (SSDA) for semantic segmentation have focused on utilizing limited labeled target data and abundant unlabeled data to bridge the domain gap at the pixel level <ref type="bibr">[1,</ref><ref type="bibr">2]</ref>. Early approaches like MME <ref type="bibr">[65]</ref> and ASDA <ref type="bibr">[58]</ref> used entropy minimization for feature alignment, but their classificationcentric strategies struggled with fine-grained segmentation tasks, leading to suboptimal boundary delineation. To address this, SSL-based methods such as DECOTA <ref type="bibr">[91]</ref> and SS-ADA <ref type="bibr">[89]</ref> employed teacher-student frameworks with consistency constraints, generating pseudo-labels for unlabeled target data. However, these methods faced issues with noisy pseudo-labels, particularly for minority and boundary classes. More recent methods have explored novel directions: S-Depth <ref type="bibr">[32]</ref> leverages self-supervised depth estimation as an auxiliary task to enhance feature learning, while DSTC <ref type="bibr">[24]</ref> introduces a domain-specific teacher-student framework that dynamically adapts to target domain characteristics. IIDM <ref type="bibr">[23]</ref> proposes an innovative inter-intradomain mixing strategy to address domain shift and limited supervision simultaneously. However, these methods often struggle with two critical limitations: class confusion due to limited supervision and skewed data distribution favoring source domain representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Vision Language Model</head><p>Vision-Language Models (VLMs), like CLIP and its extensions <ref type="bibr">[34,</ref><ref type="bibr">50,</ref><ref type="bibr">60,</ref><ref type="bibr">99]</ref>, leverage large-scale image-text pre-training for semantic segmentation via a shared embedding space that aligns visual and textual features. Initial zero-shot methods, such as MaskCLIP <ref type="bibr">[98]</ref> and GroupViT <ref type="bibr">[87]</ref>, struggled with boundary precision due to reliance on high-level features. Later, fine-tuned models like OpenSeg <ref type="bibr">[25]</ref> and LSeg <ref type="bibr">[40]</ref> improved segmentation accuracy using labeled data and text embeddings. Techniques such as ZegFormer <ref type="bibr">[17]</ref> and OVSeg <ref type="bibr">[46]</ref> utilize frozen CLIP features for mask proposal classification, while ZegCLIP <ref type="bibr">[100]</ref> aligns dense visual-text embeddings in a streamlined man-ner. Recently, SemiVL <ref type="bibr">[33]</ref> has shown that language cues can enhance semantic insights and mitigate class confusion; however, their application in domain adaptation remains under-explored. The recent LIDAPS model <ref type="bibr">[53]</ref> and follow-up works <ref type="bibr">[20,</ref><ref type="bibr">37,</ref><ref type="bibr">84]</ref> apply language guidance for domain bridging in panoptic segmentation but rely on manual thresholding for pseudo-mask filtering and a complex multi-stage training process. Despite their improvements, these methods still misclassify tail classes (e.g., fence, bike, wall), posing critical risks for applications like autonomous driving, where errors in identifying such objects can lead to severe consequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Class-Imbalance Handling</head><p>Class imbalance significantly hinders real-world semantic segmentation, as small object classes often appear less frequently and cover fewer pixels than dominant background classes, unlike balanced datasets like CIFAR-10/100, Ima-geNet, and Caltech-101/256 <ref type="bibr">[15,</ref><ref type="bibr">26,</ref><ref type="bibr">38]</ref>. Data-level methods like oversampling/undersampling adjust sampling probabilities for minority classes but struggle in dense tasks due to uneven class distribution <ref type="bibr">[64,</ref><ref type="bibr">101]</ref>. Algorithmic strategies such as class-weighted losses address bias by penalizing rare classes more <ref type="bibr">[47]</ref>, but treating small object classes equally often leads to instability <ref type="bibr">[69,</ref><ref type="bibr">88]</ref>.</p><p>In unsupervised domain adaptation (UDA) for segmentation, common strategies include data-level adjustments using source domain frequencies <ref type="bibr">[29]</ref><ref type="bibr">[30]</ref><ref type="bibr">[31]</ref>, and adaptive weighting based on target statistics <ref type="bibr">[88]</ref>, but these are computationally costly for dense predictions. Approaches that relax pseudo-label filtering for rare classes still inherit source biases, causing misclassifications <ref type="bibr">[102]</ref>. Most UDA methods prioritize data-level sampling <ref type="bibr">[56]</ref>, overlooking the synergy of combining data and algorithmic approaches, which remains impractical and lacks generalizability for diverse tasks <ref type="bibr">[68,</ref><ref type="bibr">76,</ref><ref type="bibr">82]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Proposed Method</head><p>In our SSDA setting, we utilize image-label pair from the source domain</p><p>i=1 , and a large pool of unlabeled target data</p><p>, where N T r U k N T r L . Our proposed SemiDAViL framework effectively tackles the challenges of SSDA by leveraging VLpretrained encoders (subsection 3.1) for enriched semantic representation learning from Sr &#8746; T r L &#8746; T r U , addressing the issue of misclassification among visually similar classes. We incorporate dense semantic guidance from language embeddings (subsection 3.2) to enhance instance separability and reduce domain bias. Consistency-regularized SSL (subsection 3.3) mitigates over-reliance on source features, while the class-balancing DyCE loss (subsection 3.4) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Vision-Language Pre-training</head><p>Previous regularization-based SSDA methods have shown effectiveness in semi-supervised semantic segmentation by enforcing stable predictions on unlabeled data. However, as discussed in section 2, they often struggle with distinguishing visually similar classes, especially when only a limited set of labeled target samples {(x T L i , y T L i )} is available. The primary issue arises due to the lack of diverse semantic coverage, leading to errors in class discrimination. To address this, we leverage Vision-Language Models (VLMs) like CLIP <ref type="bibr">[60]</ref>, which are trained on large-scale image-text datasets, D clip = {(x, t)}, where x and t are images and their associated captions. CLIP consists of a vision encoder E V and a language encoder E L , optimized jointly using a contrastive loss: To mitigate the limited semantic knowledge in standard consistency training in our SSDA framework, we initialize our (student-teacher) segmentation encoders E {S,T } V with CLIP's pre-trained vision encoder E V , rather than using an ImageNet-trained backbone. This transfer of rich semantic priors enables enhanced feature extraction and better semantic differentiation (as found in Table <ref type="table">3</ref> and well supported in <ref type="bibr">[33]</ref>), particularly for visually ambiguous classes, leading to more robust segmentation performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Dense Language Guidance (DLG)</head><p>Most prior VLM methods employ a standard attention mechanism for multi-modal feature integration <ref type="bibr">[16,</ref><ref type="bibr">49,</ref><ref type="bibr">96]</ref>, i.e., features from two modalities (query and key) generate an attention matrix to aggregate vision features based on language-derived weights. However, this approach only utilizes the language feature to compute attention scores, without directly incorporating it into the fused output, effectively treating the result as a reorganized single-modal vision feature. Consequently, the output vision feature dominates the decoder, leading to a substantial loss of language information. Based on our empirical findings (provided in supplementary file, and well supported in <ref type="bibr">[49]</ref>), we argue that while generic attention effectively processes value inputs, it fails to fully exploit query features for deep cross-modal interaction, resulting in insufficient fusion of vision and language modalities.</p><p>To address this, we utilize Dense Language Guidance (DLG) that transforms both the vision and language features into key-query pairs and treats them equally, as shown in Figure <ref type="figure">3</ref>. First, visual features F V &#8712; R h&#215;w&#215;c with h &#215; w dimension and c channels for image X are extracted through</p><p>To further utilize language embeddings F L &#8712; R n L &#215;c , we extract text description with n L tokens for X using off-the-shelf captioning model C, followed by CLIP-initialized language encoder E L with frozen weights &#966; L : F L &#8592; E L C(X ); &#966; L . This is followed by projection of F {V,L} to key-value pairs using linear layers: F {K,V } {V,L} &#8592; Linear(F {V,L} ). Next, multi-modal key values are used to generate an attention matrix A &#8712; R n L &#215;h&#215;w :</p><p>Instead of applying attention to a single modality as in conventional methods, we normalize across both dimensions and compute cross-attention on vision and language features. Specifically, we employ a SoftMax activation followed by attention over F V V and F V L to generate languageattended vision features and vision-attended language features, respectively, ensuring balanced and comprehensive feature fusion:</p><p>Finally, these two attended feature maps are combined to generate a true multimodal feature representation </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Consistency Training (CT)</head><p>To effectively utilize labeled and unlabeled data in our SSDA setting, we utilize a student-teacher network <ref type="bibr">[70]</ref> for consistency training. Specifically, for unlabeled target data</p><p>, we obtain two multimodal features F S M , F T M (from DLG), and pass them through two identical but differently initialized decoders {D S V , D T V } with trainable parameters {&#952; S V , &#952; T V }, respectively, and enforce their predictions to be consistent:</p><p>1 student-teacher encoders E</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>{S,T } V</head><p>represented as E V for simplicity.</p><p>where y S p and y T p are the p th pixel prediction from student and teacher model:</p><p>To further utilize labeled target data {D Sr &#8746; D T r L }, we can employ a supervised CE loss between ground truth y and student prediction y S :</p><p>where N C represents the number of classes. However, L S might incur suboptimal performance due to inherent class imbalance, as discussed in subsection 2.3 and evident in previous methods in Table <ref type="table">4</ref>, Table <ref type="table">5</ref>. We propose a dynamic CE loss to alleviate this shortcoming, as detailed in subsection 3.4. The student branch is updated using a combined consistency and DyCE loss, whereas the teacher model is updated using an exponential moving average (EMA) of the student parameters:</p><p>where t is step number, &#945; is the momentum coefficient <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Class-balanced Dynamic CE Loss (DyCE)</head><p>The Cross-Entropy (CE) loss measures the difference between predicted probabilities and ground truth labels by computing a negative log-likelihood for each class, averaged over all instances in the mini-batch:</p><p>where y i,c and p i,c are the GT and predicted probability for class c, S is the batch size. Taking the gradient of L CE for each instance, we have:</p><p>Hence, CE loss only updates the gradient for the target class per instance, using a uniform weight of -1 S . This leads to two key problems in large imbalanced datasets: (1) equal weighting across classes overlooks class imbalance, treating frequent and rare classes the same; (2) the gradient magnitude becomes vanishingly small as N scales to millions, causing ineffective updates (gradient vanishing). While recent studies have proposed reweighting schemes to address the class imbalance issue <ref type="bibr">[59,</ref><ref type="bibr">73]</ref>, they fail to tackle the core problem of diminished gradient magnitudes (refer to supporting evidence in supplementary file), limiting the optimization efficiency in dense segmentation tasks.</p><p>To address this, we propose a Dynamic CE (DyCE) loss that dynamically adjusts the weighting of gradients based on the class distribution within each mini-batch, addressing the persistent class imbalance issue that remains even after discarding simple instances. The key idea is to adaptively align the gradient contributions to the real-time class distribution at every training step. This is formalized as:</p><p>where f c = i&#8712;H y i,c is the total count of class c in the mined subset H which consists of h% hardest instances from the batch, f H = |H| is the count of instances in subset H. The loss computation involves four key steps: (1) computing the standard CE loss for each sample; (2) creating a subset H from the batch, similar to <ref type="bibr">[85]</ref>;</p><p>(3) assigning dynamic class weights</p><p>, inversely proportional to the mined class frequency; and (4) scaling the loss by a volume weight -1 f &#969; H , which adjusts for the batch size and mined subset size. The hyperparameter &#969; &#8712; (0, 1) acts as a weight-balancing factor, balancing the influence of instance-level and class-level weighting. The gradient of DyCE loss is:</p><p>Here</p><p>S as compared to Equation <ref type="formula">8</ref>, as S g f H g f c and hence the vanishing gradient issue is resolved.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Dataset Description</head><p>We evaluate our proposed SSDA method on a segmentation task by adapting from two synthetic datasets, GTA5 <ref type="bibr">[62]</ref> and SYNTHIA <ref type="bibr">[63]</ref>, to the real-world Cityscapes dataset <ref type="bibr">[14]</ref>. The Cityscapes dataset consists of 2,975 training images and 500 validation images, all manually annotated with 19 classes. Since the test set annotations are not publicly available, we evaluate on the validation set, and tune the hyper-parameters on a small subset of the training set, following previous works <ref type="bibr">[4,</ref><ref type="bibr">32]</ref>. GTA5 provides 24,966 training images, and we consider the 19 classes that overlap with Cityscapes. The SYNTHIA dataset includes 9,400 fully labeled images, and we evaluate results based on the 16 classes it shares with Cityscapes.</p><p>Furthermore, to validate our DyCE loss's effectiveness, we evaluate on an extremely imbalanced medical dataset, Synapse <ref type="bibr">[39]</ref>. The Synapse dataset comprises 30 CT scans covering 13 different organs (i.e., foreground classes): spleen (Sp), right and left kidneys (RK/LK), gallbladder (Ga), esophagus (Es), liver <ref type="bibr">(Li)</ref>, stomach (St), aorta (Ao), inferior vena cava (IVC), portal and splenic veins (PSV), pancreas (Pa), and right and left adrenal glands (RAG/LAG). In this dataset, foreground voxels make up only 4.37% of the entire dataset, with 95.63% background, and the right adrenal gland contributes a mere 0.14% of foreground, whereas liver consists of 53.98% foreground, underscoring the severe class imbalance. Following the setup of <ref type="bibr">[76]</ref>, we split the dataset into 20 scans for training, 4 for validation, and 6 for testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Implementation Details</head><p>Following SemiVL <ref type="bibr">[33]</ref>, we utilize ViT-B/16 vision encoder <ref type="bibr">[18]</ref> and a Transformer text encoder <ref type="bibr">[74]</ref>, both initialized with CLIP pre-training <ref type="bibr">[60]</ref> and generate dense embeddings following <ref type="bibr">[98]</ref>. The initial learning rate is set to 10 -4 , decaying exponentially with a factor of 0.9. We set the weight decay to 2&#215;10 -4 and momentum to 0.9. Following <ref type="bibr">[35]</ref>, we use BLIP-2 <ref type="bibr">[43]</ref> as our off-the-shelf captioning model C for all domains. T h, &#945; is set to 0.95, 0.999, following <ref type="bibr">[28,</ref><ref type="bibr">33]</ref>. Following <ref type="bibr">[23,</ref><ref type="bibr">72]</ref>, source images are resized to 760 &#215; 1280 and target images to 512 &#215; 1024, followed by random cropping to 512 &#215; 512. SemiDAViL is trained for 40k iterations which takes &#8764; 15 hours on an NVIDIA RTX4090 GPU using Python environment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Findings and Comparison with SoTA</head><p>Our proposed SemiDAViL framework demonstrates significant gains on both GTA5&#8594;Cityscapes and Table 3. Ablation experiments using three different SSDA settings on GTA5&#8594;Cityscapes and Synthia&#8594;Cityscapes to identify the contribution of individual components: Consistency Training (CT), Dynamic Cross-Entropy loss (DyCE), Vision-Language Pre-training (VLP), and DenseLanguage Guidance (DLG). Components GTA5 &#8594; Cityscapes Synthia &#8594; Cityscapes CT DyCE VLP DLG 100 200 500 1000 100 200 500 1000 &#10003; ---54.5 58.2 62.3 64.6 60.2 65.3 71.5 72.0 &#10003; &#10003; --63.3 64.5 65.9 68.2 68.7 70.1 72.2 74.7 &#10003; -&#10003; -65.6 66.8 69.1 69.9 71.9 73.0 74.7 75.9 &#10003; -&#10003; &#10003; 70.3 71.6 72.1 73.9 74.9 76.8 77.7 79.2 &#10003; &#10003; &#10003; &#10003; 71.1 72.5 72.9 74.8 76.9 77.2 78.6 79.7</p><p>Synthia&#8594;Cityscapes benchmarks (Table <ref type="table">1</ref>, Table <ref type="table">2</ref>). We compare against state-of-the-art UDA, SSL, and SSDA techniques, highlighting its robustness with varying levels of target annotations.</p><p>In the GTA5&#8594;Cityscapes scenario, (A) using only 100 labeled target samples, SemiDAViL achieves 71.1% mIoU, outperforming the previous best, IIDM <ref type="bibr">[23]</ref>, by 1.6%. The advantage grows with 200 labeled samples, where we attain 72.5% mIoU, showcasing our framework's strength in leveraging limited annotations through language-guided features; (B) in the fully unsupervised setting, our method achieves 67.7% mIoU, a 5% improvement over the previous best, DIGA <ref type="bibr">[67]</ref>, owing to our language-guided joint embedding, which provides more robust semantic alignment than divergence-based methods like DaFormer <ref type="bibr">[29]</ref>; and (C) with 2975 labeled samples, our model reaches 75.2% Table <ref type="table">4</ref>. Class-wise performance evaluation of our proposed method (with and without the proposed class-balancing DyCE loss), and comparison with the existing class-balanced UDA and SSDA methods. We report 19-class and 16-class mIoU scores on the GTA5 &#8594; Cityscapes and Synthia &#8594; Cityscapes settings, respectively with 100 labeled target samples. The segmentation performance of tailed classes significantly improves by incorporating our DyCe loss in both settings. Our results are highlighted whereas the previous-best and second-best results are marked in red and blue. Please refer to supplementary file for detailed class distribution and improvement analysis.</p><p>GTA5 &#8594; Cityscapes Type Methods Target Labels Road Sidewalk Building Walls Fence Pole T-Light T-sign Veg Terrain Sky Person Rider Car Truck Bus Train Motor Bike mIoU UniMatch [92] 97.2 79.3 90.6 36.5 52.1 56.7 64.2 72.1 91.1 59.0 93.6 77.5 53.5 93.4 73.8 79.8 67.8 49.6 71.2 71.5 Supervised U2PL [79] 2975 97.6 82.1 90.7 39.3 53.4 58.0 64.2 74.0 90.9 60.2 93.7 77.0 49.8 93.7 64.8 77.9 47.9 51.2 73.6 70.5 CADA [90] 91.3 46.0 84.5 34.4 29.7 32.6 35.8 36.4 84.5 43.2 83.0 60.0 32.2 83.2 35.0 46.7 0.0 33.7 42.2 49.2 IAST [54] 93.8 57.8 85.1 39.5 26.7 26.2 43.1 34.7 84.9 32.9 88.0 62.6 29.0 87.3 39.2 49.6 23.2 34.7 39.6 51.5 DACS [72] 89.9 39.7 87.9 30.7 39.5 38.5 46.4 52.8 88.0 44.0 88.8 67.2 35.8 84.5 45.7 50.2 0.0 27.3 34.0 52.1 Shallow [4] 91.9 48.9 86.0 38.6 28.6 34.8 45.6 43.0 86.2 42.4 87.6 65.6 38.6 86.8 38.4 48.2 0.0 46.5 59.2 53.5 ProDA+distill [44] 87.8 56.0 79.7 46.3 44.8 45.6 53.5 53.5 88.6 45.2 82.1 70.7 39.2 88.8 45.5 59.4 1.0 48.9 56.4 57.5 UDA CPSL+distill [44] 0 92.3 59.9 84.9 45.7 29.7 52.8 61.5 59.5 87.9 41.5 85.0 73.0 35.5 90.4 48.7 73.9 26.3 53.8 53.9 60.8 ALFSA [81] 95.9 71.5 87.4 39.9 39.0 44.6 52.6 60.4 89.1 50.7 91.3 73.1 48.3 91.3 55.3 63.7 26.3 55.8 68.7 63.4 SS-ADA+UniMatch [89] 96.4 75.0 89.2 43.7 45.1 53.3 58.2 68.8 90.7 55.4 93.8 75.8 49.7 91.6 54.6 67.4 43.6 47.2 69.4 66.8 SS-ADA+U2PL [89] 96.5 75.5 89.7 47.1 47.7 55.3 60.6 68.1 90.6 55.3 92.1 77.4 52.5 92.5 67.1 67.8 41.2 49.9 70.8 68.3 Ours w/o DyCE 97.3 80.2 89.5 50.2 49.2 58.3 62.3 69.3 90.6 57.7 93.2 78.6 53.9 92.9 68.4 67.9 47.9 56.5 70.9 70.3 SSDA Ours w/ DyCE 100 98.1 80.9 90.3 51.9 51.5 60.2 62.5 71.1 90.9 59.5 93.2 79.9 56.8 93.2 71.6 68.1 49.1 58.6 71.4 71.1 Synthia &#8594; Cityscapes UniMatch [92] 97.5 82.1 91.2 52.4 53.0 60.7 66.3 75.3 92.3 -94.1 79.9 57.5 94.4 -82.1 -57.9 74.5 75.7 Supervised U2PL [79] 2975 97.5 81.7 90.0 36.9 50.9 56.8 59.9 71.7 91.6 -93.1 76.5 43.5 93.6 -75.4 -45.2 72.1 71.0 FADA [77] 84.5 40.1 83.1 4.8 0.0 34.3 20.1 27.2 84.8 -84.0 53.5 22.6 85.4 -43.7 -26.8 27.8 45.2 IAST [54] 81.9 41.5 83.3 17.7 4.6 32.3 30.9 28.8 83.4 -85.0 65.5 30.8 86.5 -38.2 -33.1 52.7 49.8 DACS [72] 80.6 25.1 81.9 21.5 2.6 37.2 22.7 24.0 83.7 -90.8 67.6 38.3 82.9 -38.9 -28.5 47.6 48.3 Shallow [4] 90.4 51.1 83.4 3.0 0.0 32.3 25.3 31.0 84.8 -85.5 59.3 30.1 82.6 -53.2 -17.5 45.6 48.4 ProDA+distill [44] 87.8 45.7 84.6 37.1 0.6 44.0 54.6 37.0 88.1 -84.4 74.2 24.3 88.2 -51.1 -40.5 45.6 55.5 UDA CPSL+distill [44] 0 87.2 43.9 85.5 33.6 0.3 47.7 57.4 37.2 87.8 -88.5 79.0 32.0 90.6 -49.4 -50.8 59.8 57.9 SS-ADA+U2PL [89] 91.0 62.0 86.7 38.9 33.4 53.6 58.9 69.0 91.0 -92.5 73.9 44.6 92.3 -69.3 -37.3 67.2 66.4 SS-ADA+UniMatch [89] 97.1 79.4 90.2 49.8 49.8 56.9 58.2 72.2 91.6 -93.4 78.1 53.3 92.8 -69.1 -48.4 72.1 72.0 Ours w/o DyCE 98.5 78.9 91.6 52.7 52.8 62.3 63.9 74.3 91.5 -94.4 79.7 56.9 93.1 -76.6 -55.5 74.9 74.9 SSDA Ours w/ DyCE 100 98.9 80.9 92.2 57.6 56.2 63.8 67.1 76.7 91.9 -95.9 80.6 59.9 93.8 -78.9 -59.8 76.6 76.9</p><p>mIoU, surpassing the previous best, IIDM by 1.9%, attributed to the effective handling of class imbalance via the DyCE loss. For Synthia&#8594;Cityscapes, similar gains are observed against the previous best methods like IIDM <ref type="bibr">[23]</ref> and ALFSA <ref type="bibr">[81]</ref>, demonstrating the strength of our dense language guidance and adaptive DyCE loss. Competing methods, such as DACS++ <ref type="bibr">[72]</ref>, attempt to address class imbalance via pseudo-label refinement, but their reliance on multi-stage training and hyperparameter tuning limits scalability. Unlike these, our end-to-end solution with DyCE loss adaptively re-weights based on class distribution, yielding balanced learning without the need for manual adjustments.</p><p>Overall, our framework consistently outperforms across all benchmarks and supervision levels, demonstrating its robustness and scalability. Detailed class-wise analysis is provided in subsection 4.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Ablation Experiments</head><p>We conduct ablation experiments to analyze the effectiveness of each component in our proposed framework: Consistency Training (CT), Dynamic Cross-Entropy loss (DyCE), Vision-Language Pre-training (VLP), and Dense Language Guidance (DLG) in Table <ref type="table">3</ref>. (A) Starting with CT as our baseline, we observe moderate performance with mIoU scores of 54.5% and 60.2% on GTA5 and Synthia respectively with 100 labeled samples. (B) The addition of our DyCE loss significantly improves performance by addressing class imbalance issues, boosting the mIoU by 8.8% (to 63.3%) and 8.5% (to 68.7%) respectively. This, along with the findings in Table <ref type="table">4</ref> and Table <ref type="table">5</ref> demonstrates the effectiveness of the proposed dynamic loss function to address class imbalance. (C) When incorporating VLP without DyCE, we achieve better results than DyCE alone, with mIoU improvements of 11.1% (to 65.6%) and 11.7% (to 71.9%) over the baseline. This demonstrates the effectiveness of leveraging semantic knowledge from pretrained vision-language models. (D) The addition of DLG further enhances performance substantially, reaching 70.3% and 74.9% mIoU, as it enables fine-grained semantic understanding through dense language embeddings. (E) Finally, our full model combining all components achieves the best performance across all settings, with notable improvements of 16.6% (to 71.1%) and 16.7% (to 76.9%) over the baseline with 100 labeled samples. The consistent performance gains across different label ratios demonstrate the complementary nature of our proposed components in addressing the challenges of SSDA for semantic segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Detailed Analysis on the DyCE Loss</head><p>We provide a comprehensive evaluation of our proposed method (with and without the DyCE loss) against existing UDA and SSDA methods that use various solutions on the class-imbalance problem on the GTA5 &#8594; Cityscapes and Synthia &#8594; Cityscapes benchmarks, using 100 labeled target samples in</p><p>Table 4. (A) Without DyCE, our model already achieves competitive mIoU scores, but the addition of DyCE enhances class separability by adaptively re-  To further validate the effectiveness of DyCE loss, we evaluate it on a more challenging scenario: Synapse medical dataset with severe class imbalance (95.63% background, 4.37% foreground: foreground classes vary from 53.98% to 0.14%). As summarized in Table <ref type="table">5</ref>, we highlight the significant impact of our DyCE loss as a plug-in enhancement across various SSL methods <ref type="bibr">[51,</ref><ref type="bibr">75,</ref><ref type="bibr">76,</ref><ref type="bibr">78,</ref><ref type="bibr">93]</ref>. (A) For the lowest-performing general SSL method in Table 5, i.e., UA-MT <ref type="bibr">[93]</ref>, its mDSC leaps from 20.3% to 48.1%, with minority classes like Gallbladder improving from 0.0% to 49.5%, showcasing DyCE's impressive adaptive weighting. Similarly, URPC's mDSC increases from 25.7% to 48.9%, whereas the previous best general SSLbased DePL sees a boost from 36.3% to 51.8%, underscoring DyCE's capability to mitigate severe class imbalance by prioritizing underrepresented organs (e.g., Right Adrenal Gland from 0.0% to 26.9%). (B) Recent SoTA of balanced SSL like DHC and A&amp;D also benefit, with DHC's DSC rising from 48.6% to 57.9%, and A&amp;D reaching 65.5% (up from 60.9%). Notable gains include improvements in Gallbladder (DHC: 66.0% to 70.3%) and Left Adrenal Gland (A&amp;D: 27.2% to 30.1%), demonstrating DyCE's effectiveness as a plug-in loss across class-imbalanced datasets. Further experimental findings, qualitative and quantitative analysis are provided in the supplementary material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In this work, we introduced SemiDAViL, a novel SSDA framework that leverages vision-language guidance and a dynamic loss formulation to address key challenges in domain-adaptive semantic segmentation. Through comprehensive experiments on several SSDA and SSL benchmarks, our method demonstrated consistent improvements over state-of-the-art techniques, especially in low-label regimes and class-imbalanced scenarios. The integration of vision-language pre-training, dense language embeddings, and the proposed DyCE loss contributes to discriminative feature extraction, better handling of minority classes, and enhanced semantic understanding. Overall, SemiDAViL sets a new benchmark in SSDA, showcasing strong generalizability across diverse domain shifts and label constraints.</p></div></body>
		</text>
</TEI>
