<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Transparent Object Tracking with Enhanced Fusion Module</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10443601</idno>
					<idno type="doi"></idno>
					<title level='j'>Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Kalyan Garigapati</author><author>Erik Blasch</author><author>Jie Wei</author><author>Haibin Ling</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[{ k g a r i g a p a t i , h l i n g } @ c s . s t o n y b r o o k . e d u e r i k . b l a s c h @ g m a i l .]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. IN TRO D U C T I ON</head><p>Object tracking is a fundamental problem in robotics that aims to locate and identify an object in a sequence of images or videos. Researchers have dedicated much effort <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>, <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref> to addressing various challenges in object tracking, such as occlusions, fast-moving objects, and changing lighting conditions. However, tracking transparent objects is a somewhat less explored topic. Transparent objects, such as glass and plastic, are common in everyday life, and reliably tracking them has numerous practical applications in robotics <ref type="bibr">[5]</ref>, surveillance, and augmented reality. Transparent object tracking can be used in robotic medical procedures to track and visualize the movement of glass vials and syringes.</p><p>Though there is a pressing need to track transparent objects reliably, it is very challenging. These objects possess unique properties since they primarily borrow texture from the background and are also reflective. When such an object moves, its appearance changes drastically due to background influence. These properties pose severe issues to appearancebased trackers as they tend to extract feature information 1 Stony Brook University, Stony Brook, NY, USA 2 Air Force Research Lab, Arlington, VA, USA 3 City College of New York, New York, NY, USA algorithm with state-of-the-arts <ref type="bibr">[1]</ref>, <ref type="bibr">[6]</ref> on three challenging sequences from TOTB <ref type="bibr">[6]</ref>. Owing to the effective fusion technique tailored for transparency awareness, TOTEM can accurately localize transparent objects under challenging scenarios. All figures in this paper are best viewed digitally, in color, significantly zoomed in.</p><p>from visual cues of striking color and edge patterns. Thereby generic trackers tend to rely on falsely extracted background features, thus performing poorly on transparent objects.</p><p>In contrast to some application-specific tracking tasks such as person tracking or UAV tracking, transparent object tracking suffers from the absence of a dedicated training dataset. Consequently, end-to-end training to improve tracking performance is impractical currently. To overcome this challenge, recent research has proposed to use knowledge transfer techniques to imbue generic trackers with transparency awareness. Specifically, features from a backbone module trained for transparent object segmentation are fused into the tracker pipeline. It is hypothesized that such a backbone encodes transparent textures well and thus helps trackers perform with accuracy.</p><p>However, while the above feature fusion approach seems promising, it is not always straightforward. Simple fusion techniques may not always be effective, as the fusion of features in a pipeline can disrupt the feature space and require retraining of the entire model to learn to utilize the fused features. Retraining can be particularly challenging when labeled data is scarce. The solution in <ref type="bibr">[6]</ref> uses ATOM <ref type="bibr">[7]</ref> and DiMP <ref type="bibr">[3]</ref> trackers, which are capable of consuming fused features without requiring full retraining, as they consist of fully online-learned modules. However, this approach may not be viable for many state-of-the-art trackers that rely on components pre-trained on large datasets.</p><p>Our proposed fusion technique selectively fuses trans-parency features with the original ones without disrupting the feature space, thus allowing for integration with most trackers. Our module consists of a transformer encoder block and an MLP block. The transformer block has attention layers to efficiently fuse transparency information. The MLP block projects the fused features back into the original feature space. This property of our fusion module allows for the integration of learned transparency priors in many trackers. Moreover, we have demonstrated that the fusion module can be trained efficiently in a two-step process. Specifically, an additional pre-training step is performed, which compels the fusion module to rely exclusively on transparency features for tracking by cutting off the feed of originally extracted features to the fusion module. Further, we design a new tracker, called TOTEM (Transparent Object Track-ing with feature Enhancing Module), that uses our fusion methods to achieve robust performance on transparent object scenarios, as shown in Fig. <ref type="figure">1</ref>.</p><p>The contributions of this work are as follows:</p><p>&#8226; We propose a novel transparency feature fusion module for tracking transparent objects. &#8226; We devise a novel two-step training strategy for effective learning. &#8226; We design a new tracker architecture TOTEM aimed at better transparent object tracking. &#8226; We perform extensive experiments over the transparent object tracking benchmark (TOTB) <ref type="bibr">[6]</ref> with ablation studies to showcase the benefit of our design choices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. R E L AT E D W O R K</head><p>Transparent objects and tracking. Transparent objects present unique challenges for classification, segmentation, and tracking due to their optical properties. Previous studies <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref> have proposed handcrafted techniques that rely on reflective and refractive light properties to model transparent objects. Recently, due to the progress of deep learning techniques, algorithms that gain complex skills by learning from huge data have shown promising results. The works of <ref type="bibr">[11]</ref>, <ref type="bibr">[12]</ref> prove that learnable components such as convolution-based feature extractors and transformer encoder blocks can leverage from training on labeled transparent object datasets for accurate segmentation. Similarly, <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref> learn over huge data to model transparent objects. However, the problem of tracking transparent objects remains a challenge due to the scarcity of labeled datasets. To address this, a large tracking benchmark named TOTB is constructed in <ref type="bibr">[6]</ref> for transparent objects. Further, they proposed a transfer learning approach that introduces transparency awareness into existing generic object trackers. However, their method is only applicable to trackers with online learned tracking modules. In contrast, our proposed fusion module does not have any restrictions on applicability. Given the recent popularity of transformers in tracking architectures <ref type="bibr">[1]</ref>, <ref type="bibr">[15]</ref>, <ref type="bibr">[16]</ref>, which are typically pre-trained models, our approach shows promise in leveraging these strong baselines. Particularly, our model is built on top of TOMP <ref type="bibr">[1]</ref>, a transformer model prediction tracker.</p><p>Segmentation and Dataset. Research over transparent objects has gained momentum in recent years, with several datasets such as <ref type="bibr">[14]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[11]</ref> providing valuable sources for learning transparency priors for object segmentation. In this work, we leverage the pixel-level segmentation dataset <ref type="bibr">[11]</ref>, which includes annotations for five different categories of transparent objects. This dataset closely represents realworld transparent objects and provides accurate pixel-level labeling for improved localization. While the dataset from <ref type="bibr">[17]</ref> offers exhaustive labeling, it is not used in this work due to the synthetic nature of the objects and their limited representation of real-world scenarios. Further, we use different portions of TOTB <ref type="bibr">[6]</ref> for training and benchmarking our tracker algorithm. Feature fusion. Recently, more attention has been devoted to multi-modal architectures. These works mainly benefit from the early fusion techniques <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref> like concatenation <ref type="bibr">[6]</ref>, feature pruning <ref type="bibr">[20]</ref>, and re-weighting <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>. These fusion methods mainly aim at merging the information from multiple modalities and do not necessarily operate as learnable modules. Lately, more robust fusion methods were proposed that utilize the transformer's attention mechanism to fuse features. For example, the works of <ref type="bibr">[23]</ref>, <ref type="bibr">[1]</ref> use transformers for fusing image features.</p><p>Our proposed fusion technique distinguishes itself from existing ones by being designed to work with pre-trained networks. Unlike existing fusion modules, which are trained as part of the end-to-end training of the network, our fusion module is trained separately to produce features that are compatible with pre-trained networks. To achieve this, we equip our fusion module with MLPs to project the features to the known latent space of the pre-trained network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I I I . PROPOSED METHOD</head><p>The core idea of our proposed method is to enhance the effectiveness of generic object trackers for transparent object tracking. The TOMP framework (detailed in section III-A) serves as the baseline object tracker. Next, we describe a separate network for extracting transparency features in section III-B. Then, we present a novel fusion technique in section III-C that combines these features with the baseline object tracker to enhance its effectiveness.</p><p>A. Baseline Tracker -TOMP One of the robust paradigms for visual object tracking is discriminative model prediction-based target localization. Specifically, a kernel (target model) is predicted to accurately represent the appearance of the target object and is used to localize the target in subsequent frames by proposing bounding boxes. A transformer-based model predictor, TOMP <ref type="bibr">[1]</ref>, utilizes the self-attention operations between test and reference branch features to produce a kernel.</p><p>TOMP consists of a test and a training branches. The training branch operates on two input ground-truth/memory frames Itr1 , Itr2 R H &#215; W &#215; 3 where H and W indicate the image size. In the train branch, the target state information (bounding box size and position) is encoded and fused with deep image features xtr1 , xtr2 R h &#215; w &#215; c . Features from both the training and test branches are jointly processed in the transformer model predictor. This module contains a transformer encoder and a decoder block. The encoder produces enhanced features by reasoning across the test and train branch features. The decoder operates on the processed features from the encoder to generate the desired kernel. This kernel is then applied over the encoder features using separate classification and regression heads to produce the target's center response map y &#710; R h &#215; w &#215; 1 and the bounding box size estimations response map d R h &#215; w &#215; 4 respectively. Here d represents the offsets from the predicted center point to the sides of the bounding box, encoded as (left, top, right, bottom) adjustments.</p><p>The entire network is trained end-to-end by minimizing classification loss L1 , and regression loss L 2 produced over the input of randomly selected frame triplets &#10216;Itr1 , Itr2 , Ite &#10217; from a video sequence. We propose to utilize this architecture for the problem of transparent object tracking and further make modifications (Fig. <ref type="figure">2</ref>) to improve its performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Transparency Feature extraction</head><p>As discussed earlier, one of the challenges the trackers face with transparent object tracking is the transparency property causing visual distortion over the target's appearance cues. To overcome this issue, trackers need to gain an understanding of the texture of transparent objects. Specifically, they should be able to abstract out the texture of a transparent object from its background and use this knowledge to localize the same object (even when it adapted a different background). Such ability is not something a typical opaque object tracker would gain while training. So in our work, we adopt a backbone that is trained to extract the transparency features. One simple method to benefit our tracker with transparency awareness is to train it over transparent object video sequences. Since we lack annotated video sequences, we incorporate transparency awareness by transferring it from another model (trained for a different objective).</p><p>We propose to use a separate backbone network (as shown in Fig. <ref type="figure">2</ref>) that has the ability to understand the transpar-ent object's texture (caused by refraction, reflection, and translucence). Motivated by the transfer learning approach in <ref type="bibr">[6]</ref>, we adopt a similar approach of using the feature extractor from a segmentation network. Particularly, the segmentation network Trans2Seg <ref type="bibr">[11]</ref> trained to segment and classify pixels belonging to transparent objects is used. By hypothesizing that such a segmentation network must intermediately learn to encode the patterns from transparency features like reflectivity, refractivity, and translucence, we propose to use the feature extractor and encoder part of Trans2Seg <ref type="bibr">[11]</ref> to produce transparency features.</p><p>Trans2Seg consists of a convolution-based backbone, a transformer encoder module, a transformer decoder module and a segmentation head, connected sequentially in this or-</p><p>In the end-to-end training environment, the incentive of the backbone and transformer encoder would be to produce features that encode the unique properties of transparent objects. The decoder and the segmentation head would learn specific priors to categorize the transparent objects. Since we are mainly interested in image encodings, we adopt the backbone and encoder module part of Trans2Seg as the transparency feature extractor for our tracker.</p><p>The backbone module takes in an input image I R H &#215; W &#215; 3 and produces a feature vector x &#732; R h &#215; w &#215; c , where H and W are the image height and image width respectively, and h, w and c are the height, width, and the number of channels of the produced feature map. Further, the transformer encoder operates over the input feature x &#227;nd produces a globally attended and enriched feature map x &#8242; which has the same shape as that of x &#732;. We refer our reader to <ref type="bibr">[11]</ref> for more details of this module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Fusing the transparency features</head><p>Why fusion. One way to utilize transparency features for tracking is to replace the tracker's backbone with the above transparent feature extractor directly. But this may hurt the tracking performance because the transparency backbone is trained for a less related objective and thus may not extract features specific to the tracking problem. For example, a motion-blur-affected object is never encountered when training the Trans2seg network, whereas correctly extracting motion features is critical for tracking. So, we adopt a fusion-based approach to take advantage of the transparency feature while still retaining essential cues for tracking. Also, this way, the tracker learns to selectively ingress the useful encodings of the input image detail.</p><p>However, there are certain challenges to using transparency features in the above-discussed transformer model predictor architecture. Firstly, all the components in this tracker are offline learned, meaning that any change in architecture that modifies intermediate feature space must be accompanied by offline re-training. The perk of direct inference without training after feature fusion, as observed in <ref type="bibr">[6]</ref>, does not exist with the selected baseline TOMP. Further, we do not have a large-scale training dataset consisting of transparent object video sequences. So we must adopt a simple fusion mechanism that does not require full-scale retraining from scratch.</p><p>We found that it is best to fuse the transparency features into the TOMP pipeline just before the transformer model predictor block. This way, we can leverage the strong local and global reasoning provided by the transformer encoderdecoder module over the transparency features.</p><p>The fusion module (depicted in Fig. <ref type="figure">2</ref>) is designed taking into account the following constraints:</p><p>-The end-to-end model, after the transparency feature fusion, should not require re-training over the large datasets, given their lack of availability -fusion of transparency features should not regress the tracker's performance on transparent object tracking -it should be lightweight both in terms of the number of learnable parameters and the number of computations To be able to reuse most of the learning modules, we designed our fusion module to be trained without having to re-train the existing components of the TOMP. While this design choice helps with the above constraints, it poses certain challenges. The TOMP model predictor is completely made of learned parameters, and it expects the input features to belong to a specific feature space. The feature space refers to the mapping between each channel in the feature vector and the set of specific patterns that activate a channel's response. Most of the machine-learned components are sensitive to the feature space of the input. For example, we cannot simply replace the backbone network of a classification model with a better feature extractor and see a performance improvement. At least the classification heads have to be re-trained before the model can produce any meaningful output.</p><p>For the same reason, we cannot simply concatenate the transparency features with the features extracted by the TOMP backbone to achieve performance improvement. In fact, this will cause the network to lose performance because the transparency features are unexpected perturbations (noise) to the model predictor. So, we propose a feature fusion module and a training strategy that produce enriched features by combining useful cues from each source. Because of the training objective, the module produces a fused feature that would align with the feature space of the original TOMP backbone. Fusion Module. Our transformer-based feature fusion module sits between the backbone and transformer encoder stages of the TOMP pipeline and fuses the features x R h &#215; w &#215; c and x &#8242; R h &#215; w &#215; c into a new feature x &#8242; &#8242; R h &#215; w &#215; c . This module is designed to operate pixel-wise rather than to use global context information. So, the fusion occurs between the corresponding feature vectors x &#10216; i , j &#10217; R c and x R at every pixel position &#10216;i, j &#10217; {[0, h) &#215; [0, w)}. Note that attention operations do not occur across spatial locations.</p><p>The module consists of two main components: 1) Transformer Encoder and 2) a Fully Connected Projection module.</p><p>1) Transformer Encoder: The Transformer Encoder fuses the vectors x &#10216; i , j &#10217; and x &#8242; by transforming a query embedding equery into an intermediate feature representation finterim (shown in eq. 1 and 2). Inspired by the architecture described in <ref type="bibr">[1]</ref>, <ref type="bibr">[24]</ref>, we designed this module Tenc with multiple encoder layers. But different from <ref type="bibr">[24]</ref>, we do not use a 1 &#215; 1 convolutional layer to project the features into a smaller dimension, as this would throw away important detail. Also, we do not add any positional embeddings, as no spatial information needs to be preserved. Each encoder layer follows standard architecture and consists of a multi-head self-attention module and a feed-forward network. We perform experiments in the next section exploring the effect of using a query embedding versus using one of the transformed input features. z = concat(x &#10216;i,j &#10217; , x &#10216;i,j &#10217; , equery ) R 3 &#215; c  (1) 2) Fully Connected Projection Module: Additionally, hypothesizing that we need a separate module &#981; to project the fused features finterim onto the latent space, on which the TOMP's model predictor operates, we employ a twolayer fully connected neural network (see eq. 3). Further, to match the distribution of feature activations across spatial and channel dimensions between the newly projected feature and the original feature x , we add an instance normalization layer at the end of the module &#981;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Tracker Pipeline</head><p>The end-to-end structure of the proposed model TOTEM is illustrated in Fig. <ref type="figure">2</ref>. First, we use the original backbone network to extract the deep image features of both the test and train branch input frames. Parallelly, we also extract the transparency features for the same set of input frames. These features are then fused independently using the proposed fusion module. The fused features in the training branch are further combined with target state encodings. The features from both branches are flattened and concatenated to form a sequence of feature vectors. This sequence is then processed by the transformer encoder to produce enhanced features by reasoning globally across frames. Next, the Transformer Decoder predicts the target model weights by using the output of the transformer encoder. Finally, the predicted model is applied to the test branch features output by the transformer encoder to localize the target.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Training</head><p>All the components belonging to the TOMP, and the Trans2Seg model, are initialized with pre-trained weights while the weights of the fusion module are initialized randomly following a Xavier initialization <ref type="bibr">[25]</ref>. The training is performed over two steps where only the fusion module weights are updated with each back-propagation. In the first step (illustrated in Fig. <ref type="figure">3</ref>), TOTEM does not use the features extracted by the original backbone. Instead, the fusion module only uses transparency features to produce a compatible output. In the first step configuration, the fusion module learns to use the transparency features (belonging to an unrelated feature space) with TOMP's model predictor. The second step follows the usual setting where both the original and transparency features are input to the fusion module.</p><p>Empirically this two-step approach showed better performance compared to the usual training approach. We hypothesize that this approach works effectively because, without the first step, the fusion module learns to over-rely on the original features which are already in the feature space the module is learning to project into. So, by forcing the module to learn to solely use the transparency features, we encourage the fusion module to first learn to recognize transparency features. Then during the second step, it leverages the learned priors from step one to effectively fuse transparency features into the pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I V. E X P E R I M E N T S A. Implementation and Setup</head><p>Datasets. For evaluation, we mainly use the Transparent Object Tracking Benchmark (TOTB) <ref type="bibr">[6]</ref> dataset. This dataset comprises 15 common transparent object classes, with each containing 15 sequences (225 in total). Given the lack of any labeled training data sets, we split TOTB into two sections: a small section comprising 45 video sequences belonging to 3 object classes (Beaker, GlassBall, and WubbleBubble) is used for training while the remaining 180 sequences (belonging to other 12 object classes) are used for testing. Since the fusion module is a lightweight module with only around 5 million learnable parameters, it can be trained from very little data. Additionally, the training partition of the LaSOT <ref type="bibr">[26]</ref> dataset is used in addition to the above-defined TOTB train partition.</p><p>Training. We train our tracker on above mentioned splits of TOTB and LaSOT datasets for 25 epochs with 4000 image triplets sampled at every epoch. We set batch size as 18 and used ADAMW <ref type="bibr">[27]</ref> optimizer with a learning rate of 0.0001. The proposed fusion module uses a 4-layer transformer encoder and operates on 256-dim feature vectors. The training is performed in two steps, as described in section III-E. We used 12GB TITAN X p GPUs to train and test our model. Our model TOTEM runs at 6FPS during inference.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Comparison study</head><p>To establish baselines for comparison, we employ the recent transparent object tracker TransATOM <ref type="bibr">[6]</ref> and its base ATOM <ref type="bibr">[7]</ref>. In addition, we also compare against TOTEM's base tracker TOMP <ref type="bibr">[1]</ref>. To ensure fair comparison all models are fine-tuned end-to-end on the same training dataset. Note that, TOTEM is also fine-tuned end-to-end for this experiment.</p><p>We report our results using the success (SUC), normalized precision (NPRE), and precision (PRE) metrics. The SUC plot in Fig. <ref type="figure">4a</ref> displays the overlap precision OP as a function of the threshold T h. The NPRE and PRE plots (shown in Fig. <ref type="figure">4b</ref>  ranked according to their area-under-the-curve (AUC) score for each plot, which is presented in the legend. Our proposed TOTEM tracker outperforms the previous state-of-the-art TransATOM tracker by a significant margin of 13.4% in terms of Success AUC. Importantly, our proposed tracker outperforms its baseline TOMP by 3.3% thanks to the transparency cues incorporated by our fusion module.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Attribute analysis</head><p>In order to analyze the performance of our algorithm on certain tracking challenges, we evaluate our tracker TOTEM under 12 different attributes. We explore the performance gain specifically due to the addition of transparency feature fusion by comparing TOTEM against the baseline TOMP. Also, we include the evaluations of ATOM vs TransATOM in this section so that we can compare the performance gains dues to transparency feature infusion in our work against that in <ref type="bibr">[6]</ref>.</p><p>Both the baselines TOMP and ATOM are directly adapted from their respective works <ref type="bibr">[1]</ref>, and <ref type="bibr">[7]</ref> without any modifications, whereas TransATOM and TOTEM follow the same training settings as described in the above section IV-B.</p><p>Tab. I lists the comparison results against all 12 attributes using the success AUC metric. We observe that TOTEM performs best on 10 out of 12 attributes. TOTEM shows a major improvement in the case of Illumination Variation, Deformation, Aspect Ration Change, and Low Resolution attributes (see Fig. <ref type="figure">5a</ref>, 5b and 5c respectively) outperforming its baseline with Sucess AUC scores of 82.2%, 79.2%, 73.5% and 74.0% by 7.1%, 13%, 6.4% and 10.9% respectively. This huge improvement in tracking accuracy can be directly attributed to the use of transparency features in the pipeline. Deformation and aspect ratio changes are a result of variations in the target object's shape. Such variations are hard to be dealt with if a tracker cannot fully understand the target's appearance. For example, a backbone network that does not understand a target might encode two variant poses of it into embeddings that do not relate well. Such inefficiency in the backbone can further cause the model prediction module to perform poorly at generating kernel weights that produce accurate localization. In the case of TOTEM tracker, the Trans2Seg model is extensively trained to understand transparent objects and thus has the ability to extract relevant features. For example, it might produce embeddings invariant to background patterns, given that such property benefits the network for performing segmentation tasks on transparent objects. Having transparency features fused into our baseline tracker's pipeline will directly help with better localization. In our case, the transparency features helped the model to perform better in case of appearancevarying situations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Ablation Study</head><p>Our tracker TOTEM benefits from three main components. First, we use TOMP as the baseline, which has a significant performance advantage over the other baselines (ATOM and DiMP, for example). Second, we utilize the Trans2Seg backbone along with its encoder to extract transparency features. Third, our proposed fusion module combines the transparency features into our baseline tracker's pipeline. In this section, we ablate each component and show that the design helps improve accuracy. We additionally evaluate our two-step training strategy against other methods.</p><p>Baseline. We evaluate our baseline model TOMP against the baselines of the other transparent object trackers as shown in the Tab. II. All the trackers follow their original configuration and are not pre-trained on TOTB. This analysis provides us with the portion of improvement we solely gain by using a transformer-based model predictor, independent of other factors. We observe that TOMP outperforms ATOM and DiMP by 11.4% and 13.5% in success AUC scores, respectively. TOMP provided a better starting point which in itself has surpassed the previous state-of-the-art tracker TransATOM by a margin of 8.6%.</p><p>Transparency Features. In this subsection, we ablate the components of Trans2Seg from TOTEM to analyze the benefit due to transparency features in our pipeline.</p><p>We created a new tracker model, TOTEM-T, to enable a fair ablation study of transparency features. TOTEM- Further, we ablate components within the fusion module. We first investigate the benefit of having a learnable query embedding equery in the fusion stream by comparing TOTEM with an ablated variant, TOTEM-equery , that lacks a learnable feature in its fusion input. Tab. I V shows TOTEMequery has a slight performance drop of .5% and 1.0% in SUC and NPRE metrics, respectively, while showing only a 0.2% improvement in PE metrics. Overall a slight improvement is noticed. Given that equery is only a 256-sized floating point weight and has comparably less computation overhead, the design choice of including it is beneficial.</p><p>We also ablate the MLP module &#981; that projects the fused features into the encoder input space. In this test, we create a variant TOTEM-&#981; by ablating the MLP. In Tab. IV, when compared to TOTEM this variant showed a significant &#168; &#168; performance drop of 3.9% in the SUC metric, indicating that MLP projection module is crucial to the performance of fusion. Two-step training approach Along with the new fusion module, we proposed a two-step approach for training it, reasoning that it helps the module use transparency features well. In Tab. V, we produced results comparing the two-step approach with simple one-step training. Here, we observe that the two-step approach outperforms the simple method by 1.1% in the SUC metric. We also notice 1.8% and 0.5% gains with our approach in the PRE and NPRE metrics, respectively.</p><p>Additionally, we demonstrate the efficiency of end-to-end fine-tuning when performed in complement with two-step training. Here, we fine-tune our entire tracker instead of only updating the fusion module's weights. With this extra tuning, TOTEM observes a performance improvement of 4.3% in the SUC metric. Interestingly, observing from the SUC metrics of TOMP from Fig. <ref type="figure">4a</ref> and Tab. II, we only observe a gain of 2.3% with the baseline. This further shows the benefit of fusing the transparency features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. CO N C L U S I O N</head><p>In this work, we explored an important yet under-explored problem of transparent object tracking. We proposed a novel tracker architecture named TOTEM, which benefits from understanding the unique texture properties of transparent objects. In particular, we successfully transferred the information learned from transparent object segmentation to tracking by using the pretrained Trans2Seg (a segmentation network) model to aid our tracker with extra transparency cues. In addition, we presented a new fusion module that learns to fuse features from different streams and projects them to the feature space of the original stream. Due to the projection property, our module can be added/removed from the tracker pipeline without retraining the network. Further, we explored a new training strategy i.e., two-step training that explicitly improves the fusion performance of our proposed module. Comprehensive experiments are performed, showing that TOTEM considerably outperforms the previous state-of-the-art and its baseline. Our ablation studies show that each design choice we made toward TOTEM has a positive contribution to its performance. Future Work. The fusion module combined with our twostep training strategy shows promising performance gains. In the future, we would extend the module to aid generic trackers in gaining application-specific skills. For example, camouflaged object tracking can be made possible without explicit training data with the help of our fusion techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AC K N O W L E D G E M E N T</head></div></body>
		</text>
</TEI>
