<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Proactive Disentangled Modeling of Trigger–Object Pairings for Backdoor Defense</title></titleStmt>
			<publicationStmt>
				<publisher>Tech Science Press</publisher>
				<date>01/01/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10655402</idno>
					<idno type="doi">10.32604/cmc.2025.068201</idno>
					<title level='j'>Computers, Materials &amp; Continua</title>
<idno>1546-2226</idno>
<biblScope unit="volume">85</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Kyle Stein</author><author>Andrew A Mahyari</author><author>Guillermo Francia_III</author><author>Eman El-Sheikh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger–object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>As deep neural networks (DNNs) become more prevalent in applications such as natural language processing <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref> and object classification <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref>, they are increasingly being targeted by sophisticated security threats <ref type="bibr">[7,</ref><ref type="bibr">8]</ref>. e rise of generative AI <ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref> has enabled the large-scale creation of datasets sourced from online repositories. Although these datasets improve model robustness, they o en bypass rigorous vetting, making them vulnerable to backdoor attacks <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref>. Such attacks embed hidden triggers in training samples, causing models to misclassify inputs containing the trigger, for example, altering a stop sign's label to a speed limit sign. model is trained. is lack of focus on the dataset creation phase represents a significant gap in input-level backdoor defense strategies <ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref>. Malicious triggers can be embedded in training samples well before the model is exposed to them, undermining the integrity of the entire training process. Addressing this stage early in the pipeline not only prevents contaminated data from infiltrating the training process, but also reduces the computational costs associated with post-training purification efforts <ref type="bibr">[24,</ref><ref type="bibr">25]</ref>. Lastly, proactively analyzing the dataset offers deeper insights into the adversarial logic behind these backdoors, specifically how triggers interact with objects and how attackers strategically embed them to exploit vulnerabilities.</p><p>Although existing defenses can detect single or multiple backdoor triggers in a compromised data set <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref>, they remain strictly trigger-centric, where flagged samples are discarded, and images of objects classes bearing those triggers are ignored. is removes valuable co-occurrence information into how specific triggers map onto particular objects, which could expose systematic attacker strategies. In realistic many-tomany attack scenarios <ref type="bibr">[31]</ref>, where adversaries plant various triggers across a wide range of object categories, a trigger-only approach would fail to recognize novel trigger-object combinations outside of its training set of known trigger-object pairings. For instance, assume a square-patch trigger is only ever seen on stop signs and a pixel-noise trigger only on speed-limit signs. If an attacker then applies that same square patch to yield signs or the pixel noise to pedestrian-crossing signs (pairings never observed before) those triggercentric detectors may sharply degrade in performance, since they do not explicitly model which object the trigger appears on. By contrast, a co-occurrence-aware model that simultaneously identifies both triggers and object classes preserves the relational context between adversarial triggers and their targets. Rather than excluding compromised samples, this approach leverages modular relationships to learn comprehensive backdoor patterns and infer previously unseen trigger-object combinations. As a result, the model can accurately recognize the underlying object despite the presence of a trigger, integrate attacked examples into both training and inference workflows, and reduce false positives by distinguishing benign from malicious features. Moreover, modeling trigger-object relationships provides deeper forensic insights into attacker tactics, enabling dynamic update strategies that proactively defends models against evolving many-to-many backdoor attacks. Overall, we can summarize that existing input-level defenses in current state-of-theart (SOA) attack scenarios remain strictly trigger-centric, where: (1) they identify and discard adversarial samples, losing the underlying object semantics and missing the opportunity to reveal adversarial strategies, (2) do not focus on concurrently identifying triggers and the associated object class, and (3) fail to generalize to novel trigger-object pairings.</p><p>To address these gaps, we present Disentangled Backdoor-Object Modeling (DBOM), a proactive framework based on VLMs and prompt tuning <ref type="bibr">[9]</ref>, designed to identify and isolate unseen backdoor-object configurations. Instead of inspecting a potentially compromised model, this approach focuses on learning trigger-object configurations within web-scraped training images before they are ever fed into a downstream model. Our method surpasses current SOA pre-training defense algorithms by detecting not only the types of backdoor triggers in compromised datasets, but also the underlying objects they target, thereby capturing the adversarial logic behind these malicious trigger-object pairings. Here, we define a trigger as the backdoor attack pattern embedded into an image and an object as the benign semantic class being manipulated. DBOM then factorizes these two primitives into independent embeddings (Fig. <ref type="figure">1</ref>), enabling modular representations of trigger-object configurations <ref type="bibr">[32]</ref>. Furthermore, by capturing the relationship among triggers and objects during training, previously unseen trigger-object pairings can be detected during inference, a problem traditional single-trigger detection pipelines overlook. e contributions of our approach are as follows:</p><p>&#8226; We introduce DBOM, a novel end-to-end disentangled representation learning framework that separates triggers and objects into independent latent visual primitives. By leveraging cross-modal attention for structured latent decomposition, DBOM aims to learn each trigger pattern and each object class in isolation. At inference, it recomposes these known trigger and object embeddings to recognize combinations never seen during training, achieving zero-shot generalization over trigger-object pairings and resulting in a robust method against adaptive backdoor strategies. &#8226; Our approach incorporates a dual-branch module that features a learnable visual prompt repository along with a dynamic so prompt prefix adapter for prompt tuning. e use of a learnable visual prompt repository allows us to capture primitive-specific features for both triggers and objects, aiding in feature disentanglement. Furthermore, dynamically tuning text prompt representations based on image content, our module enhances the semantic context of each sample and improves the separation between trigger and object features. is design allows the framework to capture diverse trigger patterns across multiple object classes, overcoming the limitations of conventional defenses that assume a single, static trigger per dataset. &#8226; By integrating a proactive backdoor detection mechanism into the data curation process, DBOM identifies unseen backdoor-object attacks before downstream model training begins. A composite loss function that minimizes cross-entropy, disentanglement, and prompt alignment losses together ensures that poisoned samples are identified and isolated for removal from the dataset. 2 Related Work</p><p>Disentanglement involves separating visual primitives of images into independent components <ref type="bibr">[33]</ref><ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref>. A central strategy for addressing this task is to train models that learn these independent components and recombine them in novel ways, thereby enabling the flexible recognition of previously unseen trigger-object pairings. Li et al. <ref type="bibr">[14]</ref> apply symmetry and group theory to model primitive relationships, introducing a novel distance function. A Siamese Contrastive Embedding Network (SCEN) <ref type="bibr">[38]</ref> embeds visual features into a contrastive space to separately model primitive diversity. A retrieval-augmented approach improves recognition of unseen primitive component pairings by retrieving and refining representations <ref type="bibr">[39]</ref>. Recent methods integrate vision-language models (VLMs) such as CLIP <ref type="bibr">[9]</ref> to enhance the recognition of structured relationships between the underlying nature of images and text prompts. Compositional So Prompting (CSP) <ref type="bibr">[40]</ref> utilizes a static prompt prefix alongside learned primitive embeddings, with predictions based on cosine similarity between text and image features. Later works remove the static prefix, making the entire prompt learnable <ref type="bibr">[41,</ref><ref type="bibr">42]</ref>. In the context of DBOM, disentangling triggers and objects allows our model to factor visual embeddings into two primitive subspaces: one that captures adversarial trigger patterns and one that encodes the class object semantics. Once these primitives are learned, unseen trigger-object pairings can be inferred upon during testing.</p><p>Backdoor Attacks became prominent with the introduction of Badnets <ref type="bibr">[12]</ref>. Badnets demonstrated how adversaries can embed backdoors into DNNs by poisoning the training data with trigger-patterned images, such as a single white square or pixelated patterns, to misclassify inputs. Liu et al. <ref type="bibr">[13]</ref> introduced trojaning attacks, which differ from Badnets, by reverse-engineering neuron activations to generate adversarial triggers that maximize activations in specific neurons. Li et al. <ref type="bibr">[43]</ref> explored techniques to make triggers more covert to detection by implementing steganographic embedding, where backdoor triggers are hidden within images at a pixel level. Recent backdoor attacks include Wanet <ref type="bibr">[15]</ref>, a warping-based trigger, which introduces imperceptible image distortions as triggers instead of traditional noise perturbations.</p><p>Backdoor Defenses mostly operate in the adversarial machine learning life-cycle at the model level, leaving the dataset vetting process largely unexplored <ref type="bibr">[44]</ref>. Several works attempt to filter adversarial images before training <ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">29]</ref>, but these rely on detecting known trigger-object configurations and fail to generalize to unseen pairings. VisionGuard <ref type="bibr">[21]</ref> compares the so max outputs of original and transformed images using metrics like the Kullback-Leibler divergence to detect attacks without altering the target network. Deep k-NN <ref type="bibr">[20]</ref> leverages deep feature space clustering and k-nearest neighbor voting to detect and remove poisoned images from the training set prior to downstream model training. HOLMES <ref type="bibr">[22]</ref> employs multiple external detectors trained on both dedicated labels and top-k logits to capture subtle differences between benign and adversarial inputs. Traditional backdoor defenses assume a compromised model and attempt to mitigate attacks post-training <ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref>. However, these techniques reactively address attacks a er deployment by cleaning the model, whereas our approach proactively filters poisoned images before they enter the downstream training pipeline, preventing backdoor contamination at its source. Furthermore, these methods overlook the opportunity to identify unseen trigger-object configurations that were not seen in their model training, which is addressed in this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries and Insights</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Trigger-Object Representation</head><p>We define a backdoor configuration as a pairing of a trigger and an object, where the trigger serves as the adversarial modification and the object represents the underlying semantic class being targeted (e.g., "stop sign, " "yield sign, " "airplane"). Let T be the set of all possible triggers, and O be the set of object categories, where T = {t 0 , t 1 , . . . , t n } and O = {o 0 , o 1 , . . . , o m }. e complete set of potential trigger-object pairings is given by P = T &#215; O, where each pair (t, o) &#8712; P corresponds to a unique backdoor attack configuration.</p><p>ese pairings can be categorized into two groups: (1) seen pairings (P s ), which are explicitly observed during training, and (2) unseen pairings (P u ), which do not appear in the training set but may still be encountered during deployment. ese subsets are disjoint (P s &#8745; P u = &#8709;) and together form the complete space of possible attack configurations (P s &#8746; P u = P). During evaluation, test samples are drawn from a predefined set P test &#8838; P, which contains both seen and unseen pairings. e objective of our approach is to learn a function f &#8758; X &#8594; P test , where X represents the input space of images containing these trigger-object configurations. e function f is designed to map an image to its corresponding attack configuration, enabling generalization to unseen trigger-object pairs that were not part of the training distribution. Furthermore, we note that the goal of this paper is not to train an infected model or defend against attacked models, but to detect backdoored images before downstream model training begins.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">reat Model and Defender Goals</head><p>reat Model. We assume an adversary injects backdoor attacks based on trigger-object pairings into a web-scraped or publicly available dataset used for training a downstream DNN. e goal is to cause the model to misclassify inputs containing triggers into a target label while maintaining normal classification on clean images. Since large datasets are rarely vetted on a per-sample basis, malicious samples blend easily with clean data. Furthermore, attackers can escalate this threat by injecting multiple triggers across different classes, including novel, unseen trigger-object pairings, so that conventional defenses which expect a single static trigger fail to detect them. Consequently, the compromised data is used in downstream training, embedding hidden adversarial behaviors into the final model.</p><p>Defender's Goal. e defender's goal is to identify backdoored images prior to downstream model training, ensuring they are isolated while minimizing the misclassification of clean images. Given a potentially poisoned dataset that contains several triggers-object configurations, the defender must distinguish legitimate images from those carrying triggers. Furthermore, by concurrently identifying both the trigger and the underlying object, the defender learns vital information into the adversary's strategies. Moreover, separating the adversarial trigger from the underlying object enables the recovery of correct object semantics in backdoored samples, eliminating the need to discard these adversarial samples from training or inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Proposed Framework</head><p>DBOM leverages CLIP as its backbone by freezing its pre-trained visual and text encoders. Let f &#952; (&#8901;) denote the CLIP image encoder and g &#981; (&#8901;) denote the CLIP text encoder. Given an input image x i , the image encoder extracts visual features f v = f &#952; (x i ) &#8712; R d , which serve two purposes: (i) they are used to retrieve the most relevant visual prompts from a learnable repository, and (ii) they provide the bias for shi ing a set of learnable prefix text tokens [v1][v2][v3] via a prompt adapter network. Unlike fixed prefix templates (i.e., a photo of), our approach employs prompt tuning, a technique where these prefix tokens are treated as learnable parameters and optimized end-to-end to capture task-specific context for each image. is allows the text prompt to be tailored to the visual content of each image, promoting the alignment between visual and textual modalities. e shi ed prefix is then appended to the trigger and object word embeddings to form the final prompt t i , which is processed by the text encoder to produce text features f t = g &#981; (t i ) &#8712; R 768 . Lastly, f v and f t are decomposed and fused, and their joint representation is mapped into a separate pair space where the similarity between the image and fused features helps determine the final trigger-object prediction. Fig. <ref type="figure">2</ref> displays the overall architecture of the proposed approach. During training, each image retrieves visual prompts from the repository, shi s a learnable text prefix with a prompt adapter, and fuses decomposed image-text features via cross-attention. During inference, the framework again retrieves the top visual prompts, shi s the text prompt for each new image, and computes similarity scores to pinpoint unseen triggerobject pairings. Lastly, in separate pair spaces, the logits are computed by comparing the fused image-text features with the visual features from the frozen visual encoder, as well as the selected visual prompts and the text features from the frozen text encoder. e highest-scoring trigger-object pair is then selected as the predicted configuration. By detecting malicious seen and unseen configurations in this way, DBOM identifies backdoored configurations and isolates them for removal prior to downstream model training</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Visual Prompt Repository</head><p>e visual prompt repository comprises a collection of M learnable visual prompts {P i } M i=1 , with each prompt P i &#8712; R l &#215;d paired with a learnable key a i &#8712; R d . ese prompts capture high-level visual semantics and are refined during training. For a given image, cosine similarity is computed between the normalized image features f v and each normalized key. Based on the similarity scores, the two most similar prompts are selected. One is intended to align with the image's trigger and the other with the object. To enforce this specialization, we introduce two auxiliary losses. e trigger-object separation loss is formulated as:</p><p>Because our primary objective is to accurately flag backdoored images, the loss function prioritizes the trigger key by encouraging it to achieve a higher similarity score than the object key, with the object serving as complementary context for the image. e visual prompt diversity loss is defined as:</p><p>where m = 0.5 is a fixed margin. is term penalizes any excessive similarity between the retrieved trigger and object visual prompts, thereby promoting disentangled features for more distinct representations <ref type="bibr">[45]</ref>. Combining these terms yields:</p><p>which guides the prompts to distinctly capture trigger and object characteristics. During training, the visual prompt repository is updated end-to-end with L vis . is ensures that the repository vectors are not static but are continuously refined to distinguish between trigger and object features. e final representation of the retrieved visual prompts can be denoted by f re t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Dynamic Prefix Adapter</head><p>Traditional prompt tuning approaches <ref type="bibr">[9,</ref><ref type="bibr">40,</ref><ref type="bibr">46]</ref> use a fixed so prompt prefix, where a sequence such as [trigger][object] is appended with an initialized phrase a photo of. is means that the same prefix is applied to every sample, regardless of the unique characteristics of the trigger or object in the image.</p><p>is prefix rigidity can hinder the system's ability to accurately distinguish between different trigger-object pairs. Motivated by the work in <ref type="bibr">[46]</ref>, we propose an adaptive prompt network module that dynamically adjusts the learnable prefix tokens based on the visual content of the input image. is has been shown to transfer the frozen backbone's generalization power to entirely new tasks with very few labeled examples <ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref>.</p><p>Specifically, the prompt adapter utilizes the image features f v to compute a bias term that is added to the base prompt tokens, thus tailoring the prompt to each individual sample. Besides, by dynamically shi ing the so -prompt prefix based on each image's visual features, the prompt prefix adapter aligns the text embeddings more closely with the specific trigger and object primitives, which in turn lets the model accurately recombine those known primitives into novel, unseen pairings at inference, improving zero-shot pairing performance.</p><p>e prompt adapter is implemented as a lightweight neural network defined by:</p><p>where &#963;(&#8901;) denotes the ReLU activation function, and W 1 , W 2 , b 1 , and b 2 are trainable parameters. e output, &#966;( f v ), represents the bias added element-wise to the original prompt embeddings {&#952; 0 , &#952; 1 , . . . , &#952; p } via &#952; &#8242; i = &#952; i + &#966; i ( f v ) for i = 0, . . . , p. e final text prompt t i is constructed by appending {&#952; &#8242; 0 , &#952; &#8242; 1 , . . . , &#952; &#8242; p } with the trigger and object word embeddings, &#952; t and &#952; o , respectively. Lastly, t i is fed into the text encoder to generate the text features f t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Feature Decomposition and Fusion</head><p>To disentangle and jointly embed the representations of triggers and objects for backdoor detection, we decompose and then fuse the visual features, f v , and the text features, f t <ref type="bibr">[42]</ref>. We first isolate how each trigger and object contributes to the text representation by averaging their respective logits. is decomposition helps the model treat triggers and objects as independent primitives, ensuring that potential backdoor triggers are not blended with the underlying objects during subsequent fusion. During training, we explicitly supervise these decomposed features to capture the semantics of each trigger and object class.</p><p>Formally, we compute the trigger and object probabilities as follows:</p><p>where T is the set of possible triggers, O is the set of possible objects, and &#952; denotes the learnable parameters.</p><p>We then optimize cross-entropy losses for the trigger (L tri ) and object (L obj ) predictions:</p><p>where P s denotes the set of seen triggers-object pairings.</p><p>Next, f v and f t are fused with a cross-attention mechanism that aligns the image and text features within a joint embedding space. Specifically, we define the query Q from f t , and the key K and value V from f v . e query identifies the textual aspects that need to be emphasized in the visual representation; the key-value pairs in the visual space highlight regions or features corresponding to each textual element:</p><p>where d is the feature dimensionality. e result of this cross-attention is f t&#8594;v , a fused representation that integrates the textual context of the triggers and objects with the corresponding visual features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Training and Inference</head><p>Our framework trains in two main stages: we first adapt the so prompt so that the fused features f t&#8594;v correctly capture the target trigger-object pairings, and then we ensure the textual representation f t is consistent with the retrieved visual prompt. We compute the probability of a trigger-object pair (t, o) by comparing the image feature f v to the fused representation f t&#8594;v :</p><p>Minimizing the cross-entropy over these probabilities yields the so prompt alignment loss L sp . is encourages the shi ed so prompt to correctly identify the trigger-object pairs for samples in P s . Next, we require that the textual representation f t matches the retrieved pairing from the prompt repository. We define:</p><p>Minimizing the cross-entropy over these probabilities produces the retrieval alignment loss L ret . e total loss is a weighted sum of these components along with the prompt losses:</p><p>During inference, the learned prompt adapter shi s the prefix tokens, the visual prompts are retrieved and averaged, and the logits are computed based on the similarity between the image and text features in the pair space. e predicted trigger-object text labels are selected by: &#375; = arg max</p><p>where P test denotes the set of test trigger-object pairings, which includes seen and unseen configurations, and p sp is computed following the same procedure in Eq. <ref type="bibr">(10)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments and Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experimental Setup</head><p>Attacks and Splits. We conduct experiments using two benchmark datasets: CIFAR-10 [49] and GTSRB <ref type="bibr">[50]</ref>. CIFAR-10 contains 50,000 training images and 10,000 test images across 10 object classes, while GTSRB consists of 39,209 training images and 12,630 test images spanning 43 traffic sign classes. Recent studies <ref type="bibr">[21,</ref><ref type="bibr">51]</ref> have shown that adversaries can place backdoor triggers directly on traffic signs to mislead advanced driver-assistance and autonomous-driving systems. erefore, GTSRB provides a practical, safetycritical testbed for evaluating our proposed data-level defense system. To introduce backdoor vulnerabilities, we generate contaminated versions of all clean images using six attack patterns, while retaining the clean images themselves as an individual class. e six widely recognized backdoor attacks which are employed are: Badnets Square (Badnets-SQ) <ref type="bibr">[12]</ref>, Badnets Pixels (Badnets-PX) <ref type="bibr">[12]</ref>, Trojan Square (Trojan-SQ) <ref type="bibr">[13]</ref>, Trojan Watermark (Trojan-WM) <ref type="bibr">[13]</ref>, l 2 -inv <ref type="bibr">[43]</ref>, and l 0 -inv <ref type="bibr">[43]</ref>. ese attacks encompass a diverse range of backdoor characteristics, including universality, label specificity, and variations in trigger shape, size, and placement. is results in a trigger-object pairing space of 301 unique pairings for GTSRB and 70 pairings for CIFAR-10.</p><p>Implementation Details. We utilize PyTorch 1.12.1 <ref type="bibr">[52]</ref> for the implementation of our model. e model is optimized using the Adam optimizer <ref type="bibr">[53]</ref> and is trained over 20 epochs on the previously mentioned datasets. Both the image encoder and text encoder are based on the pretrained CLIP ViT-L/14 model, and the entire model is trained and evaluated on a single NVIDIA 2080 Ti GPU. We set M = 20 for both GTSRB and CIFAR-10. To assess scalability and accuracy trade-offs, all experiments are implemented with the smaller CLIP variants ViT-B/16 and ViT-B/32, repeating the same training schedule and hyperparameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Unseen Trigger-Object Evaluation</head><p>is experiment evaluates the performance of DBOM in both the seen (S) and unseen (U) triggerobject pairing scenarios. Specifically, the accuracy for each trigger-object pairing type is measured, assessing both the Attack (trigger) and Object classifications separately. To provide a comprehensive evaluation, we report the Harmonic Mean (HM) of the seen and unseen accuracies, which balances performance across known and novel pairings. In addition, we calculate the area under the curve (AUC), which serves as the primary metric for assessing the overall effectiveness of the model in detecting trigger-object configurations. We compare DBOM's results with CoOP <ref type="bibr">[46]</ref> and CSP <ref type="bibr">[40]</ref> since they represent two distinct approaches for leveraging CLIP in modeling triggers and objects as separate primitives in the embedding space. CoOP uses fixed, pre-computed natural language representations for the triggers and objects while learning only a context prompt prefix to condition CLIP. In contrast, CSP learns so prompts by fine-tuning learnable tokens for triggers and objects, allowing for more adaptive reconfiguration and improved generalization to unseen trigger-object pairings.</p><p>Table <ref type="table">1</ref> demonstrates that DBOM outperforms the baseline methods across nearly all metrics. DBOM improves AUC over 53% on GTSRB and nearly 43% on CIFAR-10. Furthermore, DBOM successfully identifies over 98% of backdoor triggers on both benchmarks while classifying nearly 95% of objects in the diverse GTSRB dataset (43 classes) and over 95% on CIFAR-10 (10 classes). Importantly, the high accuracy observed for unseen trigger-object pairings indicates that our model can detect trigger-object pairings that were not encountered during training. Note that DBOM not only generalizes to unseen trigger-object pairings, it also accurately identifies seen triggers: the "Seen" columns in Table <ref type="table">1</ref> show over 92 and 96% accuracy on known trigger patterns. Moreover, we report the results of smaller CLIP variants in Table <ref type="table">1</ref> and average run-times across both datasets for each variant in Table <ref type="table">2</ref>. We can observe that the ViT-B/32 and ViT-B/16 models run at an average of 2.53 ms and 4.27 ms/image, compared to ViT-L/14's 10.69 ms/image, respectively. Importantly, this reduction in compute does not result in a significant drop in accuracy: the ViT-B/32-based DBOM still achieves AUC scores of 85.03% on GTSRB and 84.43% on CIFAR-10, while the ViT-B/16 variant increases those figures to 87.86% and 87.37%. ese findings suggest that our approach can leverage smaller CLIP backbones for real-time deployment without sacrificing the high trigger-object identification performance afforded by the larger variant.</p><p>Table 2: Inference runtime per image on a single NVIDIA 2080 Ti GPU (batch size 64) CLIP Variant Inference time (ms/img) ViT-B/32 2.53 ViT-B/16 4.27 ViT-L/14 10.69</p><p>Overall, DBOM's zero-shot generalization capability to novel trigger-object pairings is achieved by leveraging the disentangled representation learning approach, which factors triggers and objects into independent primitives. Although previous methods aim for similar generalization, our visual prompt repository, dynamic prefix adapter, feature decomposition and fusion greatly improve the ability to recombine these learned representations to accurately identify novel trigger-object pairings. erefore, DBOM offers robust protection against evolving backdoor attack strategies by possessing the ability to identify seen configurations with high accuracy and then leveraging those seen pairings to identify unseen configurations, resulting in an adaptive method that can simultaneously evolve to adversarial strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Backdoor Poison Detection Evaluation</head><p>DBOM is compared against conventional pre-training dataset cleaning approaches <ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref> by simulating a realistic scenario where the poisoning rate is set at 5%, 10%, and 15%, reflecting the poisoning ratios o en encountered in web-scraped datasets. Overall accuracy (Acc.) measures the proportion of all images, both clean and poisoned, that are correctly classified. Futhermore, we report the attack recall (Rec.), indicating the percentage of poisoned images that are successfully identified. Additionally, attack precision (Prec.) measures the proportion of images flagged as attacked that are truly poisoned, and the F1 Attack score is the harmonic mean of attack precision and recall. Table <ref type="table">3</ref> summarizes the performance of DBOM relative to baseline methods. Evaluation shows that DBOM consistently results in high overall accuracy while keeping the misclassification of clean samples to a minimum. For example, on GTSRB, DBOM achieves overall accuracies of around 98% with an attack recall consistently exceeding 97% and F1 scores near 98% across poisoning rates of 5%-15%. Similar trends are observed on CIFAR-10, where overall accuracies are in the range of 97%-98%, and both attack recall and F1 scores remain high. Furthermore, our experimental results reveal an important trade-off between precision and recall. While methods such as Deep k-NN and HOLMES achieve near perfect precision, they o en suffer from lower attack recall (typically around 75%-80%), leading to significantly lower F1 scores. DBOM's modest decrease in precision is acceptable because missing a poisoned image can be far more harmful than incorrectly flagging a few additional clean images, especially when clean images make up the majority of the dataset. Lastly, unlike existing SOA methods that solely focus on identifying whether an image is backdoored or poisoned, DBOM disentangles each image's representations into primitives to identify both the trigger and the object concurrently, thereby enabling it to detect unseen configurations that were not encountered during training, a crucial improvement over existing SOA methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Ablation Study</head><p>Impact of &#955; vis . We investigate the influence of the visual prompt loss weight, &#955; vis , on DBOM's ability to disentangle trigger and object features. Recall that the visual prompt loss L vis = L sep + L div enforces higher similarity for the trigger visual prompt and diversity between the trigger and object visual prompts. Note that when &#955; vis = 0.0, the visual prompt loss is removed from the training objective and the model loses supervision to disentangle trigger and object features from the visual prompt repository, although the top two most similar prompts are still selected. e results, shown in in Fig. <ref type="figure">3</ref>, reveal that at &#955; vis = 0.0, the model achieves the lowest performance across all metrics. As &#955; vis increases, the supervision provided by the separation and diversity losses leads to improvements in both AUC and unseen accuracy, reaching a peak at &#955; vis = 0.5. is peak indicates that a moderate emphasis on the separation losses most effectively refines the latent representations. erefore, the model is able to generalize more robustly to unseen backdoor configurations. While selecting the top two prompts from the visual repository yields acceptable performance, incorporating the explicit separation and diversity losses significantly improves overall performance across all metrics. While results on CIFAR-10 show a more stable rise and fall of seen, unseen, and AUC values, the results on GTSRB show more variation over each tested &#955; vis value. Learnable vs. Static Prefix. In this experiment, we replace the learnable so prompt adapter with a static fixed prompt prefix, a photo of, to isolate the influence of a constant prefix context on model performance. Table <ref type="table">4</ref> details the performance improvement across all metrics of the learnable prefix adapter over the fixed prefix. For GTSRB, the learnable prefix leads to a 5.07% increase in object classification accuracy, AUC 3.31% and seen accuracy 2.19%. is improvement is especially significant for object classification, given that GTSRB has a diverse set of 43 classes, making the task more challenging. Similarly, on CIFAR-10, we see a notable 1.59% increase on unseen pairings, 1.38% for object classification, and 1.92% for AUC. e improvements can be attributed to dynamically adjusting the prefix tokens based on each input image's content, leading to better alignment between visual and textual representations and more precise detection. is improves the model's capability to distinguish between triggers and objects, especially when encountering unseen adversarial configurations.   In contrast, the bottom row depicts failure cases where the predicted objects differ from the ground truth (though the triggers are correctly identified). For instance, in the first error image, "30 km/h" is misclassified as "No Vehicles, " likely due to the heavy blur on the sign. Likewise, in the second example, the model predicts "Dog" instead of "Cat", a plausible mistake given the animal's appearance. e fourth image is misjudged as a "Bird" rather than an "Airplane, " suggesting that the system recognized a flying object but failed to capture its specific category. is can be attributed to some key features, such as text or outlines, being nearly imperceptible and making the difference between classes difficult to discern. Overall, despite a few misclassifications caused by blurred or partially obscured features, our model successfully distinguishes a wide range of triggers and objects. is highlights its strong robustness against challenging real-world conditions, even when subtle distortions could easily mislead other systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion and Limitations</head><p>e empirical results demonstrate that DBOM not only achieves SOA performance in detecting both seen and unseen trigger-object pairings, but also maintains high overall accuracy and attack recall even at low poisoning rates. By proactively vetting training data, DBOM prevents backdoor contamination before downstream model training, reducing the need for costly post-training purification and preserving clean samples for model learning.</p><p>By separating the backdoor trigger from the underlying object semantics, DBOM not only flags poisoned images, but also recovers the correct object label despite the presence of a backdoor pattern. is has several key benefits. First, it preserves the majority of clean examples so that benign object information is retained rather than discarded, maintaining dataset diversity and reducing the risk of eliminating clean samples. Second, disentanglement yields finer-grained forensic insights into how specific triggers map onto different object categories, revealing systematic attacker strategies and enabling more targeted threat intelligence. ird, the modular nature of trigger and object primitives enables zero-shot detection of trigger-object pairings that were unseen during training, addressing a crucial limitation in conventional trigger-centric defenses. In practice, this means DBOM can adapt to evolving backdoor tactics across multiple object classes, lower false-positive rates by distinguishing benign from malicious features, and streamline training-time vetting helping prevent data contamination at its source rather than reactively purifying a compromised model. Despite these strengths, it is important to discuss DBOM's limitations. Our design assumes that the defender maintains a library of T candidate trigger patterns, drawn from previously seen backdoor signatures. In our experiments, T is composed of six well-studied backdoor attacks, but the repository can be extended over time as new threats emerge by disentangling unknown triggers and adding them to the trigger repository. When novel trigger patterns are encountered in new data, we can fine-tune only the visual prompt repository and prefix adapter (rather than retraining the entire VLM backbone) on a small set of those examples, allowing DBOM to rapidly incorporate and detect new triggers with minimal overhead. Although DBOM currently focuses on triggers in T, exploring zero-shot discovery of entirely novel trigger patterns remains an important avenue for future work. Furthermore, the effectiveness of the model depends on the careful tuning of hyperparameters such as &#955; vis , as shown in our ablation study. Moreover, DBOM is currently dependent on VLM encoders, leading to a dependency on the VLM's pre-trained weights. If the VLM fails to classify certain object classes or detect a trigger pattern, then both the visual prompt retrieval and the prefix-tuned text embedding can be skewed, leading to lower detection rates. Mitigating this risk in the future may require fine-tuning the VLM on more diverse, trigger-specific data, or swapping in more powerful multimodal backbones as they become available. However, in this manuscript, we showed base CLIP models are well adept for this task.</p><p>While our experiments so far have focused on a select set of backdoor triggers, we have not yet evaluated DBOM against adversarial perturbations generated by methods like Projected Gradient Descent (PGD) <ref type="bibr">[54]</ref> or Fast Gradient Signed Method (FGSM) <ref type="bibr">[55]</ref>. Such attacks work by distributing pixel-level noise within a perturbation budget: when the budget is very small, the changes are imperceptible but o en yield lower attack success; when it is larger, the attack becomes more effective but also more noticeable to humans. We believe DBOM's disentangled trigger-object framework could be extended to handle perturbations with higher budgets, where the noise forms a distinct visual signature similar to the currently tested backdoor patterns and thus can cluster effectively in our visual prompt repository. In future work, we plan to explore these alternative attack types to further test DBOM's resilience. Lastly, evaluating DBOM on larger and more heterogeneous datasets and in real-world data-curation pipelines will further validate its practical utility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper, we introduced DBOM, a novel disentangled representation learning framework designed to detect both seen and unseen backdoor trigger-object pairings in training datasets. By leveraging a structured factorization of triggers and objects in the embedding space, DBOM enables robust generalization to novel backdoor configurations that evade conventional defenses. Our approach integrates a visual prompt repository and a dynamic prefix adapter to enhance the separation of adversarial triggers from underlying object representations. Experimental results demonstrate that DBOM significantly improves backdoor detection performance, outperforming SOA methods in identifying poisoned samples before they compromise downstream model training. is proactive approach not only enhances the security of DNN training pipelines but also provides deeper insights into backdoor strategies by identifying the objects associated with triggers, offering a novel method for defending against evolving backdoor threats.</p></div></body>
		</text>
</TEI>
