<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Learning Multi-instance Enriched Image Representations via Non-greedy Ratio Maximization of the l1-Norm Distances</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/01/2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10084445</idno>
					<idno type="doi">10.1109/CVPR.2018.00806</idno>
					<title level='j'>2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Kai Liu</author><author>Hua Wang</author><author>Feiping Nie</author><author>Hao Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Multi-instance learning (MIL) has demonstrated its usefulness in many real-world image applications in recent years. However, two critical challenges prevent one from effectively using MIL in practice. First, existing MIL methods routinely model the predictive targets using the instances of input images, but rarely utilize an input image as a whole. As a result, the useful information conveyed by the holistic representation of an input image could be potentially lost. Second, the varied numbers of the instances of the input images in a data set make it infeasible to use traditional learning models that can only deal with single-vector inputs. To tackle these two challenges, in this paper we propose a novel image representation learning method that can integrate the local patches (the instances) of an input image (the bag) and its holistic representation into one single-vector representation. Our new method first learns a projection to preserve both global and local consistencies of the instances of an input image. It then projects the holistic representation of the same image into the learned subspace for information enrichment. Taking into account the content and characterization variations in natural scenes and photos, we develop an objective that maximizes the ratio of the summations of a number of L1 -norm distances, which is difficult to solve in general. To solve our objective, we derive a new efficient non-greedy iterative algorithm and rigorously prove its convergence. Promising results in extensive experiments have demonstrated improved performances of our new method that validate its effectiveness.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Learning images representations plays an important role in many real-world applications due to the overwhelming amount of images and videos nowadays brought by modern technologies. Recently, image representation techniques using semi-local, or patch-based, features, such as SIFT and geometric blur, have demonstrated some of the best performance in image retrieval and object recognition applications. These algorithms choose a set of patches in an image, and for each patch compute a fixed-length feature vector. This gives a set of vectors per image, where the size of the set can vary from image to image. Armed with these patch-based features, image categorization and retrieval are recently formulated as a multi-instance learning (MIL) problem with improved retrieval, indexing and annotation performances <ref type="bibr">[19,</ref><ref type="bibr">3,</ref><ref type="bibr">40,</ref><ref type="bibr">30,</ref><ref type="bibr">28,</ref><ref type="bibr">31]</ref>. Under the framework of MIL, an image is viewed as a bag, which contains a number of instances corresponding to the patches in the image. For example, in the image in Figure <ref type="figure">1</ref> there exist a total of four patches (surrounded by the yellow bounding boxes) that represent a set of four different objects, including a car, a bicycle, and two persons (one person is riding the bicycle and the other one is standing aside). The image thereby is a bag and each of the four patches, represented as a vector, in the image is consider as an instance. If any of these instances is related to a semantic concept, the entire image will be annotated with the corresponding semantic label.</p><p>Despite a number of successes in applying MIL in image learning, there exist two critical challenges that prevent one from effectively using available visual information in an image as much as possible.</p><p>First, most, if not all, existing MIL methods only use the instances of an input image to model the semantic concepts, but not the entire image. For example, when a MIL method <ref type="bibr">[19,</ref><ref type="bibr">3,</ref><ref type="bibr">40,</ref><ref type="bibr">30,</ref><ref type="bibr">28,</ref><ref type="bibr">31,</ref><ref type="bibr">24,</ref><ref type="bibr">35,</ref><ref type="bibr">38]</ref> is used to study the image in Figure <ref type="figure">1</ref>, only the four instances are associated with some semantic concepts. However, these four patches only occupy a small percent of the area of the entire image, while the remained areas of the image that are outside of the yellow bounding boxes are completely dis- carded, which, though, could potentially convey valuable semantic information. For instance, a large area of this picture is "ground", which is closely correlated to "automobiles" and "bicycles" and could be used to improve the image categorization accuracy <ref type="bibr">[25,</ref><ref type="bibr">26,</ref><ref type="bibr">27]</ref>. Indeed, recent studies have already demonstrated that holistic image representations based upon global features are a necessity to decode scene contents <ref type="bibr">[10,</ref><ref type="bibr">37]</ref>. Therefore, it is desirable to learn an image representation that is able to capture both instance-wise and holistic information of an input image.</p><p>Second, in MIL an image is represented as a set of vectors and the numbers of the vectors in the images of a data set are different in general. Although the multiple-vector representation could describe the image details with better granularity, varied data sizes make it infeasible to use traditional machine learning models that can only deal with data represented by single-vectors, i.e., one vector per data sample. Therefore, it would be beneficial to learn a singlevector representation for an image that can integrate the information from both its instances and its entire context.</p><p>To address the above two challenges in multi-instance image learning, in this paper we present a novel image representation learning method. It first learns a projection from the instances of an input image. Then it projects its holistic image representation into the learned subspace. A schematic illustration of our new method is shown in Figure <ref type="figure">1</ref>. Through these procedures, the learned image representation simultaneously captures the information from both multi-instance image patches and the holistic summarization of the entire image. In the proposed objective to learn the projection from the instances of an input image, we aim to preserve both global and local consistencies of the instances in the projected subspace, which leads to an optimization problem that maximizes the ratio of matrix traces. By further recognizing the variations of the content and visual characterization in natural scenes and photos, we further develop the proposed objective by replacing the squared &#8467; 2 -norm distances by the &#8467; 1 -norm distances in our formulation, such that the robustness of the learned image representations against outlying samples and features is promoted <ref type="bibr">[2,</ref><ref type="bibr">4,</ref><ref type="bibr">9,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">36,</ref><ref type="bibr">20,</ref><ref type="bibr">34]</ref>.</p><p>Despite its clear motivations to integrate the information of an input image from both its local instances and its holistic representation, the proposed objective ends up to be an optimization problem that simultaneously maximizes and minimizes the summations of a number of &#8467; 1 -norm distances, which is difficult to solve in general. To solve this challenging optimization problem, we derived an efficient iterative algorithm with theoretically guaranteed convergence. It is worth noting that, different from many previous works, our new solution algorithm is a non-greedy algorithm, such that it has better chance to find the global optima. To the best of our knowledge, our new algorithm solves the general optimization problem that maximizes the ratio of the summations of the &#8467; 1 -norm distances in a non-greedy way for the first time in literature, which can find many applications to improve a number of machine learning models.</p><p>Finally, we performed extensive experiments on three benchmark multi-instance image data sets, the promising experimental results have demonstrated the effectiveness of our new method in image learning applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Learning multi-instance enriched image representations in the single-vector format</head><p>In this section, first we formalize the representation learning problem for images with semantic patches, where we introduce the notations used in this paper. Then we gradually develop the proposed objective to learn a single-vector representation of an input image that captures both global and local consistencies of the image patches and integrates the holistic information conveyed by the entire image.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Notations and problem formalization</head><p>Throughout this paper, we write matrices as bold uppercase letters and vectors as bold lowercase letters. The trace of the matrix M = [&#119898; &#119894;&#119895; ] is defined as tr (M) =</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8721;</head><p>&#119894; &#119898; &#119894;&#119894; , and the &#8467; 1 -norm of M is defined as</p><p>&#8730; &#8721; &#119894; &#119907; 2 &#119894; . In image retrieval and annotation tasks, we study a set of images and every image contains a collection of semantically meaningful patches. For a given image, we represent it as &#119987; = {x, X}, where x &#8712; &#8476; &#119889; denotes the holistic representation of the entire image and X = [x 1 , . . . , x &#119899; ] &#8712; &#8476; &#119889;&#215;&#119899; denotes a collection of &#119899; semantic patches, respectively. Here x &#119894; &#8712; &#8476; &#119889; (for &#119894; = 1, 2, . . . , &#119899;) represents a patch of the input image, which can be illustrated as a yellow box in the image in Figure <ref type="figure">1</ref>. Under this framework, every image is considered as a bag of instances (patches). In general, the numbers of the semantic patches in the images of a data are different from one to another. In order to tackle the two critical challenges analyzed before in Section 1, different from existing MIL studies that model the associations between the instances of input images and the predictive targets directly, in this paper we aim to learn from an input image &#119987; a single-vector representation of y = &#119891; (&#119987; ) that captures the information in both local patches and the entire image. Because the new representations of the images in a data set are of the same length, they can be readily used by traditional learning models in various image learning tasks. In the following, we use instance and image patches interchangeably when there is no risk of ambiguity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Our objective</head><p>In this subsection, we will develop the proposed objective to learn a new single-vector representation for an input image from its holistic representation and its semantic instances. When we integrate the holistic representation and the semantic instances of the input image, we aim to preserve both global and local consistencies among the semantic instances in the learned projected subspace.</p><p>Learning with global consistency via PCA. With recent advances in digital imaging techniques, one can easily have a camera with very high resolution. As a result, the derived visual descriptors from a raw picture are usually of high dimensionality. When the image dimensionality grows, most image retrieval and annotation methods will fail due to "the curse of dimensionality" <ref type="bibr">[8]</ref> and intractable computational costs. Thus, learning a lower-dimensional image representation while maintaining the original geometrical structures of the input image is valuable for practical use. To achieve this goal, principal component analysis (PCA) <ref type="bibr">[14]</ref> is the right tool to preserve as much information as possible by learning a projection W &#8712; &#8476; &#119889;&#215;&#119903; (usually &#119903; &#8810; &#119889;) from the semantic instances X of the input image &#119987; , which maps x &#119894; in the high &#119889;-dimensional space into a vector y &#119894; in a lower &#119903;-dimensional space by computing y &#119894; = W &#119879; x &#119894; , such that the overall variance of the input data in the projected space &#8476; &#119903; is maximized. Formally, let the global mean vector of the input data X as x = 1 &#119899; &#8721; &#119899; &#119894;=1 x &#119894; , PCA seeks the projection W by maximizing the following objective:</p><p>where</p><p>&#119879; is the covariance matrix of X and the constant factor 1 &#119899; is omitted for brevity. Because &#119973; Global maximizes the global variance of the instances of the input image in the projected subspace, the projected instances via the learned projection W are globally consistent in terms of information preservation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Learning with local consistency via neighborhood variances.</head><p>Besides taking advantage of the global consistency of the semantic instances of the input image, we further take into account the local geometric structures of these semantic instances in the projected subspace and consider their local consistency. Ideally, in the learned subspace the instances with similar semantic labels should be close to each other, while those with different semantic labels should be far away from each other. In other words, in contrast to maximizing the projected global variance, we also want to minimize the local variance around every instance in the projected subspace. Mathematically, denoting the &#119870;-nearest neighbors of x &#119894; as &#119977; &#119894; and the local mean vector of x &#119894; as x&#119894; = 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#119870;+1</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8721;</head><p>x&#119895; &#8712;{&#119977;&#119894;&#8746;{x&#119894;}} x &#119895; , we can achieve the overall local consistency by minimizing the following objective:</p><p>where, following our previous work <ref type="bibr">[33]</ref>, we define:</p><p>Obviously, S &#119871;&#119894; computes the local covariance matrix of the data points around x &#119894; . Thus minimizing tr ( W &#119879; S &#119871;&#119894; W )</p><p>ensures the local consistency around x &#119894; and minimizing &#119973; Local in Eq. ( <ref type="formula">2</ref>) ensures the overall local consistency around all the instances in a bag. Here, again the constant factor 1 &#119870;+1 is omitted for brevity.</p><p>Our objective to integrate the global and local consistencies of the semantic instances. Armed with the objectives that can capture the global and local consistencies of the semantic instances of an input image separately, we can develop a combined objective to capture both of them simultaneously. Among several possible ways to combine the two objectives in Eqs. (1-2), we can formulate our new objective using the trace ratio of matrices <ref type="bibr">[12]</ref>, which maximizes the following objective:</p><p>A critical problem of &#119973; &#8467;2 in Eq. ( <ref type="formula">3</ref>) lies in that it computes the ratio of the summations of a number of squared &#8467; 2 -norm distances, which are notoriously known to be sensitive to both outlying sample and outlying features <ref type="bibr">[4,</ref><ref type="bibr">34]</ref>. Many images from natural scenes and photos often have clustered objects. This is particularly true when there exist a crowd of people in a picture, where each individual people may not characterize the the semantic category of "person" appropriately and many instances have to be considered as outlying samples. Similarly, due to cropped objects and (partially) shaded objects in pictures, such as the car in the image in Figure <ref type="figure">1</ref>, outlying features also inevitably exist in real image data sets. Following many previous works <ref type="bibr">[2,</ref><ref type="bibr">4,</ref><ref type="bibr">9,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">36,</ref><ref type="bibr">29,</ref><ref type="bibr">32,</ref><ref type="bibr">21,</ref><ref type="bibr">34]</ref>, to deal with the feature and content variances in natural images, we propose to learn the projection by maximizing the following objective:</p><p>in which we compute the summations of the &#8467; 1 -norm distances, because the &#8467; 1 -norm distance can promote the robustness against both outlier samples and outlier features. Upon solving the optimization problem in Eq. ( <ref type="formula">4</ref>), the learned W not only the global variance of the semantic instances of an input image, but also rewards the local geometric structures of the semantic instances, which thereby is both globally and locally consistent in the learned subspace. Then we enrich the holistic representation x of the input image &#119987; by computing y = W &#119879; x, which is a fixed-length single-vector representation and can be readily used by any traditional single-instance machine learning models. This indeed is the main contribution of this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">An efficient solution algorithm</head><p>Our new objective in Eq. ( <ref type="formula">4</ref>) maximizes the ratio of the summations of a number of the &#8467; 1 -norm distances, which is obviously not smooth and thus difficult to solve in general. To solve the general problem that maximizes the ratio of the summations of the &#8467; 1 -norm distances, such as our objective in Eq. ( <ref type="formula">4</ref>), in this section we will derive an efficient iterative algorithm that is non-greedy. We will also prove the convergence of our new solution algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Solving a general ratio maximization problem</head><p>We first generalize the objectives in Eq. (3) and Eq. ( <ref type="formula">4</ref>) into the following general optimization problem and then derive its solution algorithm:</p><p>where &#937; is the feasible domain. Motivated by our previous works <ref type="bibr">[34,</ref><ref type="bibr">33,</ref><ref type="bibr">24]</ref>, we propose a simple, yet efficient, iterative framework in Algorithm 1 to solve the objective in Eq. ( <ref type="formula">5</ref>), whose convergence is rigorously guaranteed by Theorems 1.</p><p>Algorithm 1: Algorithm to solve Eq. ( <ref type="formula">5</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Randomly initialize &#119907;</head><p>Theorem 1. In Algorithm 1, for each iteration we have <ref type="bibr">(1)</ref> </p><p>&#119898;(&#119907; &#119896;-1 ) ; and (2) &#8704;&#120575;, there exists a k such that &#8704;&#119896; &gt; k &#8462;(&#119907; &#119896; ) &#119898;(&#119907; &#119896; ) -&#8462;(&#119907; &#119896;-1 ) &#119898;(&#119907; &#119896;-1 ) &lt; &#120575;.</p><p>Proof. In Algorithm 1, from step 3, we have &#8462;(&#119907; &#119896; ) -&#120582; &#119896; &#119898;(&#119907; &#119896; ) &gt; 0. Because &#8704;&#119907; &#8712; &#937; &#119898;(&#119907;) &gt; 0, we can get</p><p>&#119898;(&#119907; &#119896;-1 ) , which completes the proof of the first statement of Theorem 1.</p><p>Suppose that for the &#119896;-th iteration, there exists a &#119888; &#119896; such that &#8462;(&#119907; &#119896; ) -&#120582; &#119896; &#119898;(&#119907; &#119896; ) = &#119888; &#119896; &gt; 0. We have:</p><p>by which we can derive:</p><p>From Eq. ( <ref type="formula">7</ref>), we can derive:</p><p>Suppose that there exist a positive constant &#119862; such that lim &#119896;&#8594;&#8734; &#8721; &#119896; &#119894;=1 &#119888; &#119894; = &#119862;. If this is not true, we have lim &#119896;&#8594;&#8734; &#8721; &#119896; &#119894;=1 &#119888; &#119894; = &#8734;, by which, together with Eq. ( <ref type="formula">8</ref>), we can derive lim &#119896;&#8594;&#8734; &#8721; &#119896; &#119894;=1 &#8462;(&#119907; &#119896; ) &#119898;(&#119907; &#119896; ) = &#8734;. This, however, contradicts the fact that &#8462;(&#119907; &#119896; ) &#119898;(&#119907; &#119896; ) is bounded as defined in Eq. ( <ref type="formula">5</ref>), which means that lim &#119896;&#8594;&#8734; &#8721; &#119896; &#119894;=1 &#119888; &#119894; = &#119862; holds. Thus, we have lim &#119896;&#8594;&#8734; &#119888; &#119896; = 0, i.e., lim &#119896;&#8594;&#8734; &#119888; &#119896; &#119898;(&#119907; &#119896; ) = 0, which indicates that &#8704;&#120575; &gt; 0, there must exist a k such that:</p><p>by which and Eq. ( <ref type="formula">6</ref>), we have:</p><p>which indicates that Algorithm 1 converges to a local optimum and completes the proof of the second statement of Theorem 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Our algorithm to solve the objective in Eq. (4)</head><p>To solve our objective in Eq. ( <ref type="formula">4</ref>), according to Step 3 in Algorithm 1, we need find a solution that satisfy the constraint of W &#119879; W = I and the following inequality:</p><p>where &#120582; &#119896; is computed by</p><p>and W &#119896;-1 denotes the projection matrix in the (&#119896; -1)-th iteration, which is already known in the &#119896;-th iteration. Here, for notation brevity, we define:</p><p>Now we need solve the problem in Eq. ( <ref type="formula">11</ref>), for which we first introduce the following two lemmas. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 1. [18, Theorem 1] For any vector &#120643;</head><p>First, motivated by Lemma 1 and Lemma 2, we construct the following objective:</p><p>where &#119870;(W) and &#119873; (W) are defined as:</p><p>Here w &#119892; and w &#119896;-1 &#119892; denote the &#119892;-th column of matrices W and W &#119896;-1 , respectively; B and A &#119892; for &#119892; = 1, 2, &#8901; &#8901; &#8901; , &#119903; are defined as follows:</p><p>and sign(&#119909;) is the sign function.</p><p>Then, using the definition of &#119871;(W, W &#119896;-1 ) in Eq. ( <ref type="formula">15</ref>) and Lemmas 1-2, we can prove the following theorem.</p><p>Theorem 2. For any W &#8712; &#8476; &#119889;&#215;&#119903; , we have</p><p>The equality holds on if and only if W = W &#119896;-1 .</p><p>Proof. First, according to Lemma 1 we can compute:</p><p>Then, according to Lemma 2 we have:</p><p>which indicates that:</p><p>Combining Eq. ( <ref type="formula">21</ref>) and Eq. ( <ref type="formula">23</ref>), we can derive:</p><p>According to Lemma 1 and Lemma 2, it is easy to verify that equality holds in Eq. ( <ref type="formula">21</ref>) and Eq. ( <ref type="formula">23</ref>) if and only if W = W &#119896;-1 . Thus, equality holds in Eq. ( <ref type="formula">24</ref>) if and only if W = W &#119896;-1 . This completes the proof of Theorem 2. Now we continue to solve our objective. Let W = W &#119896;-1 , by substituting it into the objective, we have:</p><p>In the &#119896;-th iteration in solving the objective in Eq. ( <ref type="formula">4</ref>), W &#9733; satisfies:</p><p>Then, we have:</p><p>Theorem 2 and Eq. ( <ref type="formula">27</ref>) indicate that the solution of the objective function in Eq. ( <ref type="formula">11</ref>) can be transformed to solve the objective function &#119871;(W, W &#119896;-1 ) &#8805; 0, which can be easily solved by the projected subgradient method with Armijo line search <ref type="bibr">[23]</ref>. The subgradient of &#119871;(W, W &#119896;-1 ) at W is computed as:</p><p>Note that, for any matrix W the operator &#119875; (W) = W ( W &#119879; W ) -1 2 can project it onto an orthogonal cone.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 2: Algorithm to maximize &#119865; (W).</head><p>Input: W &#119896;-1 and Armijo parameter 0 &lt; &#120573; &lt; 1.</p><p>1. Calculate &#120582; &#119896; by Eq. ( <ref type="formula">12</ref>) the subgradient G &#119896;-1 = &#8706;&#119871;(W &#119896;-1 , W &#119896;-1 ) by Eq. ( <ref type="formula">28</ref>) and set &#119898; = 1.</p><p>while not &#119865; (W &#119896; ) &gt; &#119865; (W &#119896;-1 ) = 0 do 2. Calculate W &#119896; = &#119875; (W &#119896;-1 + &#120573; &#119898; G &#119896;-1 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>Calculate &#119865; (W &#119896; ) by Eq. ( <ref type="formula">11</ref>). 4. &#119898; = &#119898; + 1.</p><p>Output:</p><p>This guarantees the orthogonality constraint of the projection matrix, i.e. ( W &#119896; ) &#119879; ( W &#119896; ) = I. Algorithm 2 summarizes the algorithm to maximize &#119865; (W) in Eq. <ref type="bibr">(11)</ref>.</p><p>Finally, based on Algorithm 2, we can derive a simple yet efficient iterative algorithm as summarized in Algorithm 3 to solve ratio maximization problem for the &#8467; 1 -norm distances, i.e., our objective in Eq. ( <ref type="formula">4</ref>). Algorithm 3: Algorithm for non-greedy ratio maximization of the &#8467; 1 -norm distances.</p><p>1. Randomly initialize W 0 satisfying ( W 0 ) &#119879; W 0 = I and set &#119896; = 1. while not converge do 2. Calculate &#120582; &#119896; by Eq. ( <ref type="formula">12</ref>).</p><p>3. Find a W &#119896; satisfying &#119865; (W &#119896; ) &gt; &#119865; (W &#119896;-1 ) = 0 by Algorithm 2. 4. &#119896; = &#119896; + 1. Output: W.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Convergence analysis of our algorithm</head><p>Theorem 3. If W &#119896; is the solution of the objective function in Eq. <ref type="bibr">(11)</ref> and satisfies</p><p>Proof. Since W &#119896; is the solution of the objective function in Eq. ( <ref type="formula">11</ref>), we have</p><p>from which we can easily derive:</p><p>Now by substituting Eq. ( <ref type="formula">12</ref>) into Eq. ( <ref type="formula">30</ref>), we have</p><p>which completes the proof of Theorem 3.</p><p>Theorem 4. The objective in Eq. ( <ref type="formula">4</ref>) is upper bounded.</p><p>Proof. First, using Cauchy-Schwarz inequality we have the following for the numerator of our objective in Eq. ( <ref type="formula">4</ref>):</p><p>Obviously, given an input data set, &#8721; &#119899; &#119894;=1 &#119903; &#8741;(x &#119894; -x)&#8741; 2 is a constant, which indicates that the numerator of our objective in Eq. ( <ref type="formula">4</ref>) is upper bounded for a given data set.</p><p>Second, it can be verified that <ref type="bibr">1</ref> , by which we can derive the following for the denominator of our objective in Eq. ( <ref type="formula">4</ref>):</p><p>where &#120582; &#119894; (&#119894; = 1, . . . , &#119903;), ordered by &#120582; 1 &#8804; &#8901; &#8901; &#8901; &#8804; &#120582; &#119903; , are the eigenvalues of S &#119871; . The last inequality in Eq. ( <ref type="formula">33</ref>) is obtained by the Ky Fan's inequality <ref type="bibr">[7]</ref>, which states that tr ( W &#119879; S &#119871; W ) &#8805; &#8721; &#119903; &#119894;=1 &#120582; &#119894; . Again, given an input data set, S &#119871; is an constant matrix thereby &#8721; &#119903; &#119894;=1 &#120582; &#119894; is a constant. Thus the denominator of our objective in Eq. ( <ref type="formula">4</ref>) is lower bounded.</p><p>The two bounds in Eq. ( <ref type="formula">32</ref>) and Eq. ( <ref type="formula">33</ref>) together indicate that our objective in Eq. ( <ref type="formula">4</ref>) is upper bounded. Theorem 3 indicates that our proposed Algorithm 3 monotonically increase the objective function value in each iteration. Theorem 4 indicates that the objective function is upper bounded, which, together with Theorem 3, indicates that Algorithm 3 converges to a local optimum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>In this section, we experimentally evaluate the proposed image representation method in an automatic image annotation task, where we use the following three multi-instance image data sets: the PASCAL VOC 2010 data set <ref type="bibr">[6]</ref>, the Corel5K data set <ref type="bibr">[5]</ref>, and the Scene data set <ref type="bibr">[40]</ref>. We perform our evaluations using standard 5-fold cross-validation and report the average performances over the 5 trials.</p><p>The proposed image representation learning method has two parameters, the number of neighborhoods &#119870; of an instance and the dimensionality &#119903; of the projected subspace. In our experiments, the performance of the proposed method is very stable with respect to these two parameters in considerably large value ranges. Empirically, in all our experiments we select &#119870; = min {3, &#119899;} where &#119899; is the number of instances in an image bag and &#119903; = &#119889;/10 where &#119889; is the dimensionality of the instance vectors.</p><p>Experimental settings. We first compare our method to two baseline classification methods including support vector machine (SVM) method and the transductive support vector machine (TSVM) <ref type="bibr">[13]</ref> method. The former is the most broadly used supervised classification method in statistical learning, while the latter is an extension of the former one and is a semi-supervised classification method. Because both of these two methods are designed for single-instance data, they are not able to deal with data with representations of varied sizes. Therefore, we train and classify images using the holistic representations of the experimental images. Specifically, for each class we train a one-vs.-others classier using the images in the training data set, and classify the images in the test data set. Gaussian kernel is used in the both methods, i.e., &#119974; (x &#119894; , x &#119895; ) = exp</p><p>, where &#120573; and the regularization box parameter &#119862; are fine tuned by searching the grid of { 10 -5 , . . . , 10 -1 , 1, 10, . . . , 10 5 } via an internal 5-fold cross-validation using the training data of each of the 5 trails. The both methods are implemented using SVM &#119897;&#119894;&#119892;&#8462;&#119905; software package <ref type="bibr">[1]</ref>.</p><p>We also compare our method against two very recent MIL methods including the miGraph <ref type="bibr">[39]</ref> method and the MIMLSVM+ <ref type="bibr">[17]</ref> method. Because miGraph method is a single-label classification method, one-vs.-others strategy is used to conduct classification, one class at a time. We implement these two methods using the codes published by the respective authors. Because the both methods are multiinstance classification methods, we perform classification using the semantic instances of the input images. For our method, once the multi-instance enriched representations of the input images are learned, they can be directly fed into any traditional single-instance classifiers. Thus we evaluate our new image representation learning method using two most broadly used classifiers: the &#119870;nearest neighbour (KNN) classifier and the SVM. In our experiments, we select &#119870; = 3 in KNN classifiers and use the same settings as detailed above the SVM classifiers.</p><p>Experimental results. Because the three experimental image data sets are all multi-label data sets, we evaluate the classification performances of the compared methods using five broadly used multi-label evaluation metrics as in Table 1, where "&#8595;" indicates "the smaller is the better", while "&#8593;" indicates "the bigger is the better". We refer readers to <ref type="bibr">[22]</ref> for detailed definitions of these evaluation metrics.</p><p>The average classification performances (mean &#177; standard deviation) of the compared methods over the 5 trials of the experiments are reported in Table <ref type="table">1</ref>, from which we can see a number of interesting observations as following. First, the proposed method is consistently better than the other four competing methods, sometimes very significantly. Second, the MIL methods are generally better than the two baseline classification methods that only use the holistic image representations. This observation is reasonable in that the two baseline methods are both single-instance classification methods, which only use the holistic image representations. As a result, the important structural information contained in image patches with semantic meanings are not exploited, which leads to inferior performance. Last, but not least, the SVMs using the raw holistic image representa-tions perform drastically worse than those using the learned image representations by our new method, i.e., the holistic image representation with multi-instance enrichments. This observation firmly confirms that our proposed method can improve the image representations in terms of image annotation. To summarize, the experimental results in Table <ref type="table">1</ref> clearly demonstrate the effectiveness of the proposed methods in multi-instance multi-label image classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>In this paper, we have presented a novel image representation learning method that is able integrates the information conveyed by both local image patches and the holistic representation of the entire image. Our new method first learns a projection to preserve both global and local consistencies of the instances of the input image in a projected subspace, then it projects the holistic representation of the entire image into the learned subspace for information enrichment. Taking into account the content and characterization variations in pictures for nature scenes and photos, we developed an objective that simultaneously maximizes and minimizes the summations of a number of &#8467; 1norm distances, which is difficult to solve in general. Thus, we derived an efficient iterative solution algorithm that is non-greedy and theoretically proved to converge. Our new method has been validated in extensive experiments to simulate the real-world applications.</p></div></body>
		</text>
</TEI>
