<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Fast Vehicle Identification in Surveillance via Ranked Semantic Sampling Based Embedding</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10074628</idno>
					<idno type="doi">10.24963/ijcai.2018/514</idno>
					<title level='j'>27th International Joint Conference on Artificial Intelligence (IJCAI 2018)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Feng Zheng</author><author>Xin Miao</author><author>Heng Huang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Identifying vehicles across cameras in traffic surveillance is fundamentally important for public safety purposes. However, despite some preliminary work, the rapid vehicle search in large-scale datasets has not been investigated. Moreover, modelling a view-invariant similarity between vehicle images from different views is still highly challenging. To address the problems, in this paper, we propose a Ranked Semantic Sampling (RSS) guided binary embedding method for fast cross-view vehicle Re-IDentification (Re-ID). The search can be conducted by efficiently computing similarities in the projected space. Unlike previous methods using random sampling, we design tree-structured attributes to guide the mini-batch sampling. The ranked pairs of hard samples in the mini-batch can improve the convergence of optimization. By minimizing a novel ranked semantic distance loss defined according to the structure, the learned Hamming distance is view-invariant, which enables cross-view Re-ID. The experimental results demonstrate that RSS outperforms the state-of-the-art approaches and the learned embedding from one dataset can be transferred to achieve the task of vehicle Re-ID on another dataset.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Vehicle re-identification (Re-ID) aims at identifying whether a pair of vehicle images collected from different conditions (sensors, views or environments) belong to the same object (Identity) or not. In recent years, a few initial works have been made <ref type="bibr">[Zapletal and Herout, 2016;</ref><ref type="bibr">Liu et al., 2016b;</ref><ref type="bibr">Shen et al., 2017;</ref><ref type="bibr">Wang et al., 2017]</ref>. However, most existing methods work on the Euclidean space in which computing similarities is computationally expensive, especially because a large-scale gallery is inevitable in the traffic surveillance system <ref type="bibr">[Zheng and Shao, 2016]</ref>. Secondly, most methods either consider the task of cross-camera Re-ID <ref type="bibr">[Liu et al., 2016b]</ref> (treating all the views equally) or focus solely on one view. For example, in <ref type="bibr">[Liu et al., 2016a]</ref>, only the front view is mainly investigated. Virtually, the most difficult task of vehicle Re-ID is in the cross-view setting such as from side view to front view.</p><p>To this end, we focus on learning binary embedding for tackling the challenging task of fast cross-view vehicle Re-ID (see Fig. <ref type="figure">1</ref>). With success of deep learning <ref type="bibr">[Szegedy et al., 2015]</ref>, we also adopt a deep architecture as the function of embedding. Generally, the challenging task can be tackled by embedding images from different views into a common code space. Given a sample, the sample of the same identity would be the one that has the minimum Hamming distance in the learned space to it. However, most existing deep embedding methods are insufficient to address the challenging task of cross-view Re-ID, partially because they are specifically designed for the tasks of recognition and categorization.</p><p>To address above problems, we propose a Ranked Semantic Sampling (RSS) guided binary embedding for fast vehicle Re-ID. In this method, according to the semantic hierarchies, tree-structured attributes are first constructed to define the semantic distance. Due to the view-invariant properties of attributes <ref type="bibr">[Frome et al., 2013;</ref><ref type="bibr">Amid and Ukkonen, 2015]</ref>, the relative semantic distance is also view-invariant. Then, to improve the convergence of SGD optimizer, we adopt the attribute tree to guide the mini-batch sampling, in which the samples can be ranked according to the relative semantic distance. Owning to the ranked samples, more relative relationships can be exploited to reduce the frequencies of accessing samples. Furthermore, a probability inequality is derived to smoothly transfer the discrete optimization into a smooth problem, in which the SGD optimizer can be used without risk. The theoretical analysis guarantees that the learned Hamming distance can directly preserve the relative semantic distance. Consequently, the proposed RSS enables to effectively measure cross-view similarities and efficiently search the matched samples in a cross-view setting.</p><p>In summary, our main contributions are in four-fold: 1) We propose a novel deep binary embedding model which enables fast cross-view vehicle Re-ID. 2) The ranked semantic distance can be preserved so that the learned distance is viewinvariant (shown in Fig. <ref type="figure">1</ref>). Instead employing random sampling as existing methods, to improve the convergence and reduce the frequencies of accessing samples, we introduce a ranked semantic distance guided sampling method. 4) A probability inequality guarantees the transfer from a discrete problem to a smooth objective which SGD can be used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Vehicle Re-ID: Recently, a few initial works have been made, including a linear regression model <ref type="bibr">[Zapletal and Herout, 2016]</ref>, a coarse-to-fine framework <ref type="bibr">[Liu et al., 2016b]</ref>, a twobranch deep convolutional network <ref type="bibr">[Liu et al., 2016a]</ref>, orientation invariant features <ref type="bibr">[Wang et al., 2017]</ref> and visual-spatiotemporal path proposals <ref type="bibr">[Shen et al., 2017]</ref>. Moreover, the two recent works <ref type="bibr">[Zheng and Shao, 2016;</ref><ref type="bibr">Zheng et al., 2016]</ref> focus on improving the efficiency of person Re-ID. Attribute Learning: Recent works <ref type="bibr">[Ferrari and Zisserman, 2007;</ref><ref type="bibr">Hwang and Sigal, 2014]</ref> explicitly demonstrate that attribute is essentially beneficial to various computer vision tasks. In <ref type="bibr">[Frome et al., 2013;</ref><ref type="bibr">Hwang and Sigal, 2014]</ref>, the semantic knowledge learned in the text domain is transferred to train a model for visual object recognition. In [Amid and <ref type="bibr">Ukkonen, 2015]</ref>, a multi-view triplet embedding is proposed to produce a number of low-dimensional maps, each corresponding to one of the attributes. <ref type="bibr">[Kukliasnky and Shamir, 2015]</ref> can choose and observe a small subset of the attributes of each training example. Relative Distance Loss: In earlier years, distance based loss including contrastive loss <ref type="bibr">[Hadsell et al., 2006]</ref> and Kullback-Leibler divergences over all data points [van der <ref type="bibr">Maaten and Hinton, 2008</ref>] could be used to dimensionality reduction and visualization. Beyond pair-wise constraints, recently, various contrastive embedding methods such as triplets <ref type="bibr">[Schroff et al., 2015]</ref> and quadruplets <ref type="bibr">[Song et al., 2016]</ref> etc. are proposed to capture the high-order relative distance. To explore more contrastive information in mini-batches, (N + 1)-tuplet loss <ref type="bibr">[Sohn, 2016]</ref> and histogram loss [Ustinova and <ref type="bibr">Lempitsky, 2016]</ref> are also proposed recently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Proposed Method</head><p>Intuitively, in order to learn the optimal binary embedding, several questions can naturally be asked: Q1) How to make learning convergence faster? Q2) What types of relationships (similarities) need to be kept in the learned space?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Cross-View Binary Embedding</head><p>of the vth view, we assume that vector a v i &#8712; A v &#8712; R Na can be used to describe its corresponding semantic attributes, where V and N a are the number of views and attributes, respectively. X v and A v are the sample set and attribute set of the vth view, respectively. In our setup, two samples x u i and x v j belong to the same object (identity), only if all the corresponding items in the two attribute vectors a u i and a v j are the same. The basic requirement for embedding is that samples collected from any view will be projected onto similar binary codes if they have similar attributes. Assume that F v is a hash function from a hypothesis space, then the binary codes of x v i can be obtained by using</p><p>where K is the number of binary codes. Obviously, the ideal objective is that, &#8704;u, v, if a u i &#8801; a v j , then we have y v i &#8801; y u j and vice versa<ref type="foot">foot_0</ref> . Once the hash functions</p><p>i of the uth view in the test stage, we can obtain the samples of the same object collected from the vth view by ranking the Hamming distance D h (y u i , y v j ) between the binary codes of them:</p><p>For simplicity, we denote F u (x u i ) as F (x u i ). Therefore, the basic consideration of this paper is to learn a set of hash functions</p><p>one for each view, to achieve crossview ranking. The architecture (hypothetical space) shown in Fig. <ref type="figure">2</ref> (a) has a shared deep architecture and V separate fully connected hierarchy will be considered hash functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Ranked Semantic Sampling</head><p>In order to ensure that the learned binary code can represent a large number of relationships between semantic entities, a distance measure D s (a u i , a v j ) needs to be established to describe the difference between two attribute vectors a u i and a v j of any two samples.</p><p>Tree-structured attributes Generally, we can create structures of attributes with the help of WordNet <ref type="bibr">[Fellbaum, 2000]</ref>, which provides the semantic hierarchy of nouns. When building large-scale dataset such as ImageNet <ref type="bibr">[Deng et al., 2009]</ref>, we can also learn attributes from datasets. In order to accomplish the task of cross-view</p><p>Re-ID, we construct an attribute tree as shown in Fig. <ref type="figure">2</ref> (b) based on the semantic hierarchy and working scope of attributes. a(1) represents the attribute of a leaf node (lowest level) in the tree. The larger the index l in an attribute a(l) is, the higher the semantic levels of this attribute is. If l &lt; m, then we can call attribute a(m) as the parent attribute of the attribute a(l).</p><p>Simply, we can divide the attributes into two groups based on the working scope: shared attributes and non-shared attributes. Shared attributes are global variables, so that samples with the same values of these attributes can have different parent attributes, such as type, door number and color. If two samples share a common shared attribute (e.g. color), they may have the same or different parent attributes. However, if two samples share a common non-shared attribute (e.g. model), then they must have the same parent attribute (e.g. car make). In general, the attributes with higher hierarchies are closer to the concept of super-categories (label) whilst the ones with lower hierarchies are closer to the identity. In a word, two samples of the same identity definitely have the identical attributes. Hierarchies can be used to describe the semantic differences between two samples at a high level of understanding. Given two samples x u i and x v j with attribute vectors a u i and a v j , we can define the semantic distance between them as:</p><p>(2) where l ij (1 &#8804; l ij &#8804; N a ) is the index of lowest hierarchy where the two samples have different l ij th attributes but share all the same parent attributes above the l ij th hierarchy. N (a(l)) denotes that a(l) has N (a(l)) child nodes (sub-tree). I is an indicative function where I(a u i (l) = a v j (l)) = 1 if a u i (l) and a v j (l) are not the same and I(a(l)) = 1 if a(l) is a shared attribute. Obviously, we have D s (a u i , a v j ) = 0 when they have the same value at leaf node.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Semantic sampling</head><p>Given a training data set X with corresponding semantic attribute set A, a small batch of samples can be selected according to the semantic structure T , so that the complex relationships of samples in this mini-batch can be fully explored to guide the learning of embedding.</p><p>The sampling process is as follows: First, we randomly select a pair of samples x u 1 and x v 2 with the same attributes from two views u and v at random. Obviously, there is a u 1 = a v 2 and l 12 = 0. In general, sample x u 1 is considered as an anchor (reference) and x v 2 is considered as a positive sample. Next, we randomly select a sample x v 3 from the view v as the first negative sample by adding one step l 13 = 1. This example is somewhat similar to an anchor, but with only a different attribute. Then, in order to select more negative samples, we can perform the sampling step to the root of the tree by gradually incrementing l 1j . At high hierarchies, when we increase l 1j each time, all the shared attributes of the lower hierarchies should be reconsidered. By changing one at a time in l&#8804;l1j I(a u 1 (l) = a v j (l))I(a(l)), we select samples which has exactly the same shared attributes to the anchor sample, up to a sample with a completely different shared attribute. Finally, we obtain a mini-batch X B , where the first sample is an anchor, second one is a positive sample and all others are sorted negative samples.</p><p>The characteristics of the sampled mini-batch are distinctive. On the one hand, the adjacent two samples are the hard pairs (To answer the first question Q1: considering hard pairs will make faster convergence). On the other hand, the semantic distance between the anchor and samples from the second to the last one in the mini-batch is monotonically nondecreasing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ranked semantic distance loss</head><p>Most existing contrastive methods <ref type="bibr">[Schroff et al., 2015;</ref><ref type="bibr">Huang et al., 2016]</ref> have some potential limitations in sample sampling: 1) Triplets can only be defined by labels, so finegained categories or attributes are not modelled and therefore can not handle more challenging issues, such as identification and verification. 2) Hard samples can only be selected from mini-batches, so selectivity is limited. 3) In most models, mini-batches are randomly generated. Then, in order to learn more triplets or quads, most models must add minibatch sampling, which can be computationally expensive.</p><p>Therefore, in order to improve the efficiency of sampling, this paper proposes a ranked semantic distance loss on the mini-batch to guide the leaning of embedding (To answer the second question Q2: ranked semantic distance shown in Fg. 1). Given a mini-batch X B sampled according to the semantic structure, we define the ranked semantic distance loss as:</p><p>where [&#8226; ] + operation indicates the hinge function. Minimizing the above loss can guarantee that, in the learned Hamming space, the relative semantic distances between the anchor, the positive and negative samples are preserved. Due to j &gt; i, we have D s (a u 1 , a v j ) -D s (a u 1 , a v i ) &gt; 0 according to the semantic structure. Here are a few simple conclusions to be drawn: 1)</p><p>2) Furthermore, if there are only two negative samples in the minibatch, then the proposed loss is the quadruplet loss <ref type="bibr">[Huang et al., 2016]</ref>. 3) If there is only one negative sample in the mini-batch and the semantic distance is fixed by a value as well, then it becomes a triplet loss <ref type="bibr">[Schroff et al., 2015]</ref>.</p><p>The above loss is defined when the anchor is immobilized on the first sample of the mini-batch. In fact, when we use the following samples as anchors, we can also explore the indirect relationships implied in the semantic structure. Theorem 1. Given three samples a v i , a v j and a v k in the minibatch X B in which samples are sorted and sampled according to the semantic tree, if l 1i &lt; l 1j &lt; l 1k , then the following distance inequality<ref type="foot">foot_1</ref> holds:</p><p>This theory means that more comparative relationships would be discovered based on the ranked semantic sampling.</p><p>Hence, the additional information can further facilitate training of the model and mining of comparative features without adding sample access. R(X B ) considers the explicit relationships in mini-batch sampling. While based on Theory 1, the implied relationships in the mini-batch X B can be explored as well. Therefore, we need to minimize the following loss:</p><p>(5)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Optimization</head><p>In order to find the best function to preserve the semantic distance, we need to minimize the quantity R + (X B ) + R(X B ).</p><p>Unfortunately, however, the Hamming distance is a discrete variable that is defined based on a sign function that is not differentiable at zero. The most straightforward way is to replace the sign function directly with the auxiliary continuous variable F (x), regardless of the difference between y and F (x). The basic problem of this strategy is that the gap is likely to destroy the properties preserved by F (x). In this paper, we solve this problem by minimizing the difference between the Hamming distance and the Euclidean distance in the learning space F (x).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Quantization Loss</head><p>An auxiliary Euclidean distance between x u i and x v j is defined as D 2 e (F (</p><p>)) is differentiable. The following theory provides an upper bound of a quantization loss between the Hamming distance 4D h (y u i , y v j ) and the auxiliary distance D 2 e (F (x u i ), F (x v j )). Theorem 2. Given a functional hypotheses space F and a small value &#949;, for any two samples x u i and x v j , the following probability inequality holds:</p><p>where C = 2 3 ln 2 + 2 ln 3 + 2 ln K and L(F (x)) = e T K ln(F (x) &#8226; F (x) -e K ) 2 . The symbol &#8226; denotes the Hadamard product of entry-wise multiplication, ln(&#8226;) 2 is an element-wise operator on each entry of x and e K in which all items are one is a column vector of length K.</p><p>Obviously, searching for a function in the hypothesis space by minimizing L(F (x u i )) and L(F (x v j )) can reduce the right term of the probability inequality. The smaller the items on the right, the greater the probability that the difference between the two distances D 2 e (F (x u i ), F (x v j )) and 4D h (y u i , y v j ) will be within a smaller value &#949;. This means that minimizing the right item makes D 2 e (F (x u i ), F (x v j )) closer to 4D h (y u i , y v j ). In fact, L(F (x u i )) is a quantization loss defined on F which projects x u i into y u i . When all the items of F (x u i ) are either 1 or -1, L(F (x u i )) will reach its minimum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Overall Objective</head><p>Therefore, we can optimize R + (X B ) + R(X B ) by substituting the auxiliary distance D e /4 for the Hamming distance D h . Hence, we obtain</p><p>To guarantee the learned Hamming distance, the quantization loss L(F (x)) of all samples in the mini-batch X B should be minimized, simultaneously. Totally, our overall objective can be defined as:</p><p>where</p><p>) and &#955; is a balance parameter. Importantly, the objective function is differentiable, so an optimal hash function can be searched directly using a stochastic gradient descent (SGD) based on the mini-batch of structural sampling. The derivatives of O w.r.t F are discussed in the supplementary material. Finally, the derivative of objective O w.r.t the parameter &#952; of function F can be obtained using the chain rule: &#8706;O &#8706;&#952; = &#8706;O &#8706;F &#8706;F &#8706;&#952; . &#952; will be updated during the training stage by using the derivatives on the mini-batches.  <ref type="bibr">et al., 2015]</ref> style Inception models, a view-specific fully connected layer, and a binarization layer. The view-specific embedding layer consists of 640 cells, which are fully connected to the previous layer. The 640 units are divided into 5 groups, each of 128 units corresponds to a view. In the learning phase, the first two components need to be updated using objective in 7. Batch normalization is used for each mini-batch. In the testing phase, the binary code is obtained by binarizing the embedded values. With effective Boolean operations, efficient vehicle search can be achieved in the learning Hamming space. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Cross-View Vehicle Re-ID</head><p>CompCars <ref type="bibr">[Yang et al., 2015]</ref> is originally collected for the tasks of fine-grained categorization and verification. This dataset contains a total of 135, 846 images capturing the entire cars from 2, 004 car models. All images have been labelled as one of views including front (V1), rear (V2), side (V3), front-side (V4) and rear-side (V5). Fortunately, three hierarchies of attributes including make, model and year and the shared attributes between different models including maximum speed, displacement, door number, seat number and type are given. To make the dataset more suitable to the task of Re-ID, we carefully label each image using the 12 kinds of colors which are not offered but important for identification.</p><p>In total, we select six attributes including make, model, type, door number, color and year to construct the tree shown as 2 (b). 2, 000 images of each view are randomly selected for testing and the remaining samples are used for training. We compare our proposed RSS based binary embedding with the original work in <ref type="bibr">[Yang et al., 2015]</ref> based on <ref type="bibr">CNN [LeCun et al., 1989</ref>] and 4 state-of-the-art cross-modal hashing methods, including <ref type="bibr">CMFH [Ding et al., 2014]</ref>, CVH <ref type="bibr">[Kumar and Udupa, 2011]</ref>, <ref type="bibr">PDH [Rastegari et al., 2013]</ref> and CMSSH <ref type="bibr">[Bronstein and Bronstein, 2010]</ref>.</p><p>Our method can learn the embeddings for all views at the same time. However, since all other hashing methods can handle only two view problems, we implement these methods separately for all pairs of views. The CNN features will be considered as input to all other hashing methods. In addition to learning directly from images (ie, RSS), we also investigate the performance of our model with linear embeddings and CNN features <ref type="bibr">[Yang et al., 2015]</ref> (Linear-RSS+CNN). From the Fig. <ref type="figure">4</ref> (a), we can see that RSS and Linear-RSS + CNN have always outperformed other methods and RSS achieves better results than that of Linear-RSS+CNN. Moreover, we observe that the original model <ref type="bibr">[Yang et al., 2015]</ref> can hardly directly address the challenging task of cross-view Re-ID, but performance can be greatly improved by modelling cross-view relationships. This clearly shows that in or-der to deal with cross-view tasks, it is necessary to model the view-invariant distance.</p><p>The detailed results of cross-view Re-ID are shown in Table 1. We arrive at the same conclusion that the proposed RSS performs the best for cross-view Re-ID at 20 different settings. From this table, we can also see that the rear-side view (V5) seems to be easy to recognize, and most of the methods get better results in both settings: 1) rear-side and rear views, 2) front-side and front views than others.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Knowledge Transfer for Real-World Re-ID</head><p>VeRi <ref type="bibr">[Liu et al., 2016b]</ref> is collected from real-world urban surveillance scenes and contains a total of 776 vehicles taken by 19 cameras. 37,778 images from 576 vehicles are used for training while the remaining 13,257 images from the 200 vehicles were used for testing. Our experimental setup is the same as the original report in <ref type="bibr">[Liu et al., 2016b</ref>], but we use only images without regard to license plate recognition. In this section, we focus on the task of acquiring the knowledge transfer capability of RSS from the large-scale dataset Com-pCars to solve the vehicle Re-ID on the real-world dataset VeRi. The basic RSS model is first trained on CompCars and then fine-tuned on the training set of VeRi.</p><p>In order to make VeRi suitable for cross-view Re-ID, we carefully label the camera view of the VeRi dataset followed the setting of the CompCars dataset. Then, by using the finetuned model, the binary code of the image can be obtained directly. The three methods of <ref type="bibr">GoogLeNet [Szegedy et al., 2015]</ref>, <ref type="bibr">FACT [Liu et al., 2016b]</ref>, and AlexNet <ref type="bibr">[Krizhevsky et al., 2012]</ref> are used to compare the performance of cross-view Re-ID without considering the same view pairs. In addition to the cross-Re-ID, we also conduct cross-camera Re-ID under the same settings in <ref type="bibr">[Liu et al., 2016b]</ref> and select two other models including Bow-SIFT <ref type="bibr">[Lowe, 1999]</ref> and Bow-CN [van de <ref type="bibr">Weijer et al., 2007]</ref> from the original VeRi paper <ref type="bibr">[Liu et al., 2016b]</ref>. From Fig. <ref type="figure">4</ref> (b), we can see that RSS consistently achieves better results than other methods in both cross-view and cross-camera setups. In particular, RSS can outperform FACT, which combines three features, by exploiting the ranked semantic distance. The intrinsic reason is that the ranked semantic distance can help us discover identity features through a series of comparisons. Furthermore, we can observe that the cross-camera Re-ID tasks are much easier than cross-view tasks, regardless of the method used. The potential reason is that most cameras have similar views, and samples from two similar views are easily identified. For example, in the second row of Fig. <ref type="figure">3</ref>, 10 images of the same car were taken by 10 cameras, but the appearances of the first and fourth images were very similar. In conclusion, experiments show that ranked semantic distances do benefit mining of identity features and can be used to implement actual Re-IDs in both cross-view and cross-camera settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Complexity Analysis of Re-ID</head><p>Matching efficiency is the most important factor in a realworld system because CCTV cameras can automatically collect millions of images. However, almost all existing vehicle Re-ID algorithms are mainly focused on improving performance by integrating various complex modules. To the best of our knowledge, we are the first efficient algorithm to implement fast vehicle Re-ID and achieve competitive results.</p><p>In order to study the complexity of matching, we compared RSS with the above five methods and the other three models: FACT++ <ref type="bibr">[Liu et al., 2016b]</ref>, <ref type="bibr">OIFE+ST [Wang et al., 2017]</ref> and CNN+LSTM <ref type="bibr">[Shen et al., 2017]</ref> 3 . This is almost all of the validations on the VeRi dataset, which we can find. With the exception of RSS, GoogLeNet and AlexNet these endto-end algorithms do not have additional modules, the other methods are very complex systems that use multiple CNNs and spatial-temporal regularization. These extra modules are often very computationally expensive. For example, plate detection and recognition using tens of thousands of sliding windows is a daunting task in itself. In order to focus on Re-ID, the position of the tablet in the VeRi dataset is manually annotated. In general, given an unseen probe image, the matching consists of two steps: sample projection and similarity calculation to the samples in the gallery.</p><p>Table <ref type="table">2</ref> gives a comparison of the time complexity of projection and matching as well as a comparison of storage requirements. It is worth noting that the computation time for the hand features of the BOW-SIFT and BOW-CN is included, but for simplicity, the calculation time of additional models required by other methods is excluded. From this table, first, we can see that RSS can be much faster than other methods 4 except for two models with similar deep architec- 3 We refer the complexity analysis to <ref type="bibr">[Shen et al., 2017]</ref>. 4 Quantization in handicraft features is very computationally ex-ture. Especially for those using multiple CNNs, the benefits of RSS are even clearer. Second, more importantly, most of them are even hundreds of times more than RSS, except OIFE+ST, which has a matching time of at least 42 times. In fact, the overall efficiency depends mainly on the number of samples N in the gallery, but it is usually huge in practice. In short, if Re-ID tasks can be done in less than an hour via RSS, it takes nearly two days or more by other means. Finally, the advantages of RSS in storage are also significant, as all other methods require at least 64x capacity to store the features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, a new binary deep embedding method is proposed for the challenge task of cross-view vehicle reidentification. Its significant advantage is that through a series of comparisons, the ranked semantic distance is viewinvariant, which helps us to discover identity features that can be preserved in the learned Hamming space. The validation results show that the preserved semantic distance enables to achieve better results and can transfer the deep architecture learned on one dataset to achieve a real-world vehicle Re-ID. In the future, the ranked semantic distance can be applied to many other areas of computer vision, such as object classification and validation. Moreover, theoretically, one can derive more compact upper bound of inequality in Theorem 2.</p><p>pensive when the codebook is huge.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>&#8801; means all corresponding items are the same.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>All proofs will be provided in supplementary materials.</p></note>
		</body>
		</text>
</TEI>
