<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Zero Shot Learning with the Isoperimetric Loss</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10179539</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the  AAAI Conference on Artificial Intelligence</title>
<idno>2374-3468</idno>
<biblScope unit="volume">34</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shay Deutsch</author><author>Andrea Bertozzi</author><author>Stefano Soatto</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce the isoperimetric loss as a regularization criterion for learning the map from a visual representation to a semantic embedding, to be used to transfer knowledge to unknown classes in a zero-shot learning setting. We use a pretrained deep neural network model as a visual representation of image data, a Word2Vec embedding of class labels, and linear maps between the visual and semantic embedding spaces. However, the spaces themselves are not linear, and we postulate the sample embedding to be populated by noisy samplesnear otherwise smooth manifolds. We exploit the graph structure defined by the sample points to regularize the estimates of the manifolds by inferring the graph connectivity usinga generalization of the isoperimetric inequalities from Riemannian geometry to graphs. Surprisingly, this regularization alone, paired with the simplest baseline model, outperformsthe state-of-the-art among fully automated methods in zeroshot learning benchmarks such as AwA and CUB. This improvement is achieved solely by learning the structure of theunderlying spaces by imposing regularity]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Motivating example. A pottopod is a pot with limbs. Not even a single example image of a pottopod is needed to find one in Fig. <ref type="figure">1</ref>. However, one has surely seen plenty of examples of animals with limbs, as well as pots. In zero-shot learning (ZSL) one aims to exploit models trained with supervision, together with maps to some kind of attribute or "semantic" space, to then recognize objects as belonging to classes for which no previous examples have ever been seen.</p><p>The ingredients of a ZSL method are illustrated in Fig. <ref type="figure">2</ref>. Seen samples X s and their corresponding labels Y s are used to train a model &#966;, typically a deep neural network (DNN), in a supervised fashion, to yield a vector in a highdimensional (visual embedding) space Z. At the same time, a function s maps semantic attributes such as "has legs," "is short," or simply a word embedding of the ground truth (seen and unseen) labels of interest Y = {Y s ,Y u }, to some metric space. The name of the game in ZSL is to learn a map &#968;, possibly along with other components of the diagram, from the visual embedding space Z to the semantic space S, that can serve to transfer knowledge from the seen labels (reflected in Z) to the unseen ones. Alternatively, one can try to cluster samples with unseen labels in the visual embedding space Z, and then associate clusters to unseen labels. Focus on regularization. Transfer of knowledge hinges on some kind of regularity of the maps involved in ZSL. In practice, the visual embedding space Z and the semantic embedding S are only known through discrete samples, and the maps are learned restricted to these finite samples. One crucial theme in ZSL is, interpreting sample embeddings as "noisy points" on the otherwise differentiable manifolds Z and S, to attempt to regularize the spaces Z, S, and/or the map between them. Key contribution. Of all the various components of a ZSL method, we choose the simplest possible (Romera-Paredes and Torr 2015), except for the regularization of the semantic map. There, we introduce a sophisticated model, based on an extension of the isoperimetric inequalities from Riemannian geometry to discrete structures (graphs). We treat sample visual embeddings as vertices in a graph, with affinities as edges, and the visual-to-embedding map &#968; interpreted as a linear function on the graph. We then introduce the isoperimetric loss (IPL) to enforce regularity on the domain Z based on the flow of the function defined on it through the boundary of sets of a given volume. The resulting reg- For images in the set with seen labels X s , the labels can be estimated by maximum a-posterior over labels in the seen set Y s on the visual representation &#966;(X s ). For the unseen labels, there is no direct connection to the data because they are not seen during training. Inference is then indirect: A visual representation is inferred, and from there a semantic representation, which is compared to the semantic representation of unseen labels, minimizing some distance in the semantic space, over all possible unseen labels Y u .</p><p>ularized graph is informed by both the visual and semantic maps. We use it to perform clustering and map clusters to labels. Therefore, we take a very simple visual-to-semantic embedding function, namely a linear transformation, and indirectly regularize it by regularizing its domain and range spaces.</p><p>We expected our regularization to improve the baseline (Romera-Paredes and Torr 2015) on which our ZSL model is based. We did not expect it to surpass the (far more sophisticated) state-of-the-art in the two most common benchmark datasets used in ZSL, namely AwA <ref type="bibr">(Lampert, Nickisch, and Harmeling 2009)</ref> and CUB <ref type="bibr">(Welinder et al. 2010</ref>). Yet, the experiments indicate so. In some cases, it even outperformed methods that used human annotation for the unseen labels.</p><p>At heart, we solve a topology estimation problem. We determine the connectivity between nodes of the visual embedding graph, which defines a topology in that space informed by the semantic representation of seen attributes. Much of the literature in this area focuses on what kind of graph signal (embedding, or descriptor) to attribute to the nodes, whereas the connectivity of the graph is decided a-priori. We focus on the complementary problem, which is to determine the graph connectivity and learn the graph weights. Unlike other approaches, the connectivity in our method is informed both by the value of the visual descriptors at the vertices, and the values of the semantic descriptors in the range space. Our framework allows us to use automated semantic representation to perform ZSL, resulting in a framework which is entirely free of human annotation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preliminaries</head><p>We first describe a general formalization of ZSL that helps place our contribution in context. Every ZSL includes a supervised component, which results in a visual embedding, a collection of unseen labels or attributes, a map from these attributes to a vector (semantic) space, and a map from visual to semantic spaces. It is important to understand the assumptions underlying the transfer of information from seen to unseen attributes, which translates in regularity assumptions on the visual-to-semantic map. Supervised component. In standard supervised classification, a dataset D s is given where both the input data x and the output labels y s are seen:</p><p>where the set of seen labels, for instance 1 = "cat" and 2 = "dog," is indicated by Y s , with cardinality |Y s | = n s . The data belong to X, for instance the set of natural images. The goal of supervised learning is to infer the parameters w of a function &#966; w : X &#8594; R K + that approximates the (log)posterior probability over Y s , &#966; w (x) j log P (y s = j|x).</p><p>(</p><p>where the subscript j denotes the j-th component of the vector &#966; w (x). At test time, given an unseen image x, one can infer the unknown label &#375; associated with it as the maximum a-posteriori estimate</p><p>We indicate with Z the (latent, or representation) space where the data X are mapped,</p><p>Visual embedding. Although z i can be interpreted as logprobabilities, one can simply consider them as an element of a vector space of dimension at least K, Z &#8834; R K , called "visual embedding." It is also customary to use intermediate layers of a deep network, rather than the last one that is used for classification, as a visual embedding, so in general K = n s . A general classifier, rather than a linear one, can then be used to determine the class, based on the latent representation z. We want the formalism to be flexible, so we do not constrain the dimension of the embedding to be the same of the dimension of the seen classes, and we consider z = &#966; w (x) to be any (pre-trained) visual feature. Unseen labels. In ZSL there is a second set of "unseen"</p><p>At training time we do not have any sample images with labels in Y u . However, we do at test time.</p><p>Zero-shot learning. The goal of ZSL is to classify test images as belonging to the unseen classes. That is, to learn a map from X to Y u . Absent any assumption on how the unseen labels are related to the seen labels, ZSL makes little sense.</p><p>Assumptions. In ZSL one assumes there is a "semantic" metric vector space S &#8834; R M , to which all labels -seen and unseen -can be mapped via a function s : Y &#8594; S.I ft h e metric is Euclidean, a distance between two labels, y i ,y j , can be induced via d s (y i ,y j )= s(y i )s(y j ) . Otherwise, any other distance d(s(y i ),s(y j )) on S can be used to find the label associated to an element &#963; &#8712; S of the semantic space (embedding), for instance using a nearest-neighbor rule &#375;(&#963;)=argmin y&#8712;Y d(s(y)&#963;).</p><p>(5)</p><p>Note that the minimum could be any label, seen or unseen. This is just a metric representation of the set of labels, independent of the ZSL problem.</p><p>The second assumption is that there exists a map &#968; : Z &#8594; S from the visual to the semantic embedding, which can be learned to map embeddings of seen classes z to semantic vectors &#963; = &#968;(z) in such a way that they land close to the semantic embedding s(y s ) of the seen labels:</p><p>One could learn both the visual embedding &#966; w and the visual-to-semantic map &#968; simultaneously, or fix the former and just learn the latter. In some cases, even the latter is fixed.</p><p>Validation. The merit of any ZSL approach is usually evaluated empirically, since the assumptions cannot be validated absent samples in the unseen label class or knowledge of the transfer between the seen and unseen tasks. Once training has terminated and we have embeddings &#966; &#373; and &#968;,g i v e n test data x i , we can compare the imputed labels obtained via</p><p>with labels y u in the validation set. The construction of this loss function ( <ref type="formula">6</ref>) is illustrated in Fig. <ref type="figure">2</ref>.</p><p>Baseline. ZSL methods differ by whether they learn both the visual and semantic embedding, only one, or none; by the method and the criterion used for learning. Since the unseen labels are never seen during training, the transfer of information from seen to unseen labels hinges on the regularity of the learned maps. For this reason, much of the recent work in ZSL aims to explore different regularization methods for the learned maps. The simplest case, which requires no regularization, is to assume that all the maps are linear: &#966;(x)=Fx and s(z)=Vzfor suitable matrices F, V (Romera-Paredes and Torr 2015). The results are not state-of-the-art (see Sect.</p><p>), but we nevertheless adopt this baseline and focus on regularizing, rather than the map &#968; directly, the spaces Z and S, which is our key contribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>There are many variants for zero-shot learning (ZSL). The general formalism developed, along with the diagram in Fig. <ref type="figure">2</ref>, helps understanding the various approaches in relation to one another. The problem of zero-shot learning dates back to the early days of visual recognition when the desire to transfer knowledge from painfully learned model to more general classes emerged <ref type="bibr">(Li, Fergus, and Perona 2006;</ref><ref type="bibr">Bart and Ullman 2005)</ref>. Early modern methods consistent with our approach include <ref type="bibr">(Lampert, Nickisch, and Harmeling 2009;</ref><ref type="bibr">Norouzi et al. 2013)</ref> which can be described by a choice of fixed visual embedding &#966;, semantic embedding s, and an imposed structure of the visual-semantic map (eq. ( <ref type="formula">2</ref>) of <ref type="bibr">(Norouzi et al. 2013</ref>) in our notation)</p><p>where the sum is truncated at the T largest elements of z, so nothing is learned. A particularly simple approach, which we adopt as baseline, is (Romera-Paredes and Torr 2015), who assume that all the maps of interest are linear. In particular, they postulate</p><p>Although the map is linear, the domain and range where it is defined are not linear spaces (although their embedding space is). We adopt this choice and focus on smoothing the domain and range of the map.</p><p>Roughly speaking, zero shot learning methods can be classified into two main categories: inductive and transductive. In the inductive settings <ref type="bibr">(Zhang and Saligrama 2016;</ref><ref type="bibr">Akata et al. 2013;</ref><ref type="bibr">Lampert and Hannes Nickisch 2014;</ref><ref type="bibr">Akata et al. 2015)</ref> which has dominated zero shot learning, the unseen classes are introduced one by one and decision about each unseen instance is made instantly once it is introduced. In the transductive setting <ref type="bibr">(Li et al. 2017;</ref><ref type="bibr">Kodirov et al. 2015;</ref><ref type="bibr">Gopalan, Li, and Chellappa 2014;</ref><ref type="bibr">Changpinyo et al. 2016;</ref><ref type="bibr">Deutsch et al. 2017;</ref><ref type="bibr">Song et al. 2018</ref>) typically all unseen instances are processed simultaneously by constructing a graph where one exploits the underlying manifold structure, for example using the graph-based label propagation approach <ref type="bibr">(Gopalan, Li, and Chellappa 2014)</ref>. The problem of learning the graph Laplacian or the graph weights directly from given input data has been recently addressed in a number of works <ref type="bibr">(Kalofolias 2016;</ref><ref type="bibr">Dong et al. 2016;</ref><ref type="bibr">Egilmez, Pavez, and Ortega 2017a;</ref><ref type="bibr">Pavez and Ortega 2016;</ref><ref type="bibr">Lake and Tenenbaum 2010)</ref>. In the most general case, both the graph connectivity and the graph weights are unknown, in which case a common way to enforce smoothness is to use a regularizer which controls some level of sparsity. Perhaps the most widely used criterion is that the energy of the graph Laplacian computed from the graph signal at the vertices be small. Our approach is inspired from the isoperimetric problem, which is a classic problem in geometry. In Euclidean spaces the study of isoperimetric inequalities provides exact measures for general domains, while in Riemannian manifolds they provide some qualitative understanding of the geometry. Isoperimetric inequalities on manifolds were extended to graphs <ref type="bibr">(Chung 1997;</ref><ref type="bibr">Chung, Grigor'yan, and Yau 1999)</ref> where the analysis shares some similarities and intuition from the continuous settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description of our approach</head><p>We select a fixed visual embedding &#966; consisting of a ResNet101 architecture trained on ImageNet using all classes, to map images x onto a 2048-dimensional embedding z = &#966;(x). We assume y i s &#8712; Y s is a subset of n s =4 0 to 150 classes depending on the dataset: In AwA (Lampert, Nickisch, and Harmeling 2009) there are 50 classes, of which we consider 40 as seen and sequester n u =1 0 as unseen. In CUB <ref type="bibr">(Welinder et al. 2010</ref>) there are 200 classes, of which we consider 150 as seen and the rest unseen. We exploit a fixed semantic map s from text attributes, namely labels, onto a vector space S = R M with dimension M = 100 (AwA) to 300 (CUB), using Word2Vec <ref type="bibr">(Mikolov et al. 2013</ref>): The map &#968; : Z &#8594; S is assumed linear, &#968;(z)=Vz, where V is an M &#215; 2048 matrix, learned as in (Romera-Paredes and Torr 2015) using the seen dataset D s . To facilitate comparison to some algorithms we also use a VGGverydeep-19 on CUB rather than ResNet101.</p><p>At test time, given data x i u with unseen labels, we compute the visual representation z i u = &#966;(x i u ) &#8712; R K and then semantic embeddings s i u = Vz i u for all i =1 ,...,N u .W e construct a graph G =( Z, W ) with vertices 2 z i u &#8712; Z and edges {w ij } = W that measure the affinities between embeddings w ij = z i u ,z j u . G is a discrete representation of the smooth manifold &#966;(X) &#8834; Z. The function &#968;, restricted to G, yields s i u , with range &#968;(Z) &#8834; S, which we also assume to be a smooth manifold. In practice, because of the finite sampling and the nuisance variability in the descriptors, both the domain and range of &#968; are far from smooth. Key idea. Rather than smoothing the map &#968; : G&#8594;S, we assume it is linear in the embedding space, and instead smooth both its domain and range. We seek a non-parametric deformation represented by changes in the connectivity matrix W of the underlying graph, that minimizes the isoperimetric loss (IPL). This is a form of regularization which we introduce in the field of ZSL. The IPL measures the flow through a closed neighborhood relative to the area of its boundary. For two-dimensional (2-D) surfaces in 3-D, it is minimized when the neighborhood is a sphere. The IPL extends this intuition to higher dimensions. Application to ZSL. The result of our regularization is a new graph G , informed by the domain, range and map of the function &#968;. We perform spectral clustering on G to obtain a set of n u = |Y u | clusters {c 1 ,...,c nu }. Each of these clusters can then be associated with a label in the unseen set Y u . We do not need to know the association explicitly to evaluate our method. However, one could align the clusters to the semantic representation of the unknown labels if so desired. 3  Results. In general, there is no "right" regularizer, so we 2 We abuse the notation to indicate with Z the visual embedding space, the range of the function &#966;(X), which we assume to be a differentiable manifold, and the vertices of a discrete graph sampled from Z.</p><p>3 For instance, by finding a transformation U that solves</p><p>If so desired, one could also add to the regularization procedure a term to align the clusters to the semantic representations of the validate our approach empirically on the two most common datasets for ZSL, namely AwA and CUB. Compared to the current best methods that do not use any manual annotation, Zero-IPL reduces errors by 3.06% on AwA1 (increased precision from 73.7% to 76.03%), and by 6.91% (increased precision from 36.9% to 39.45%) on CUB. Next, we describe the specific contribution, which is the smoothing of the graph-representation of &#968;, in detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Regularization</head><p>In this section we describe in more detail our graph smoothing based on the isoperimetric loss (IPL).</p><p>Our baseline gives us a graph G with weights w ij that we want to modify. We can think of these weights as "noisy," and seek a way to regularize them, by exploiting also the function &#968; defined on G that yields semantic embeddings. Our regularization criterion is to achieve some level of compactness of bounded subsets: For a collection of subsets of the vertices with fixed size (corresponding to the volume of a subset) we want to find the subsets with the smallest size boundary. Why this might be a good criterion rests on classical differential geometry of Riemannian manifolds, where in the most basic case, the most compact manifold that encloses a fixed area with minimum size boundary is a circle. However, tools and concepts from classical differential geometry do not translate easily to graphs. Thus, we seek a technique that uses a key invariant, the isoperimetric dimension. It is transferred to the discrete setting, and we introduce the IPL as a way to control smoothness in the graph. Our criterion, quantified by the isoperimetric gap, generalizes this bias towards compactness to more general sets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Isoperimetric loss</head><p>Let B r (&#958;) be the ball around &#958; &#8712; Z of radius r, that is the set of nodes within a distance d G less than r. Let</p><p>be the flow from i towards &#958;, that is, the sum of weights of edges connecting j with points closer to &#958;. The geodesic flows &#181;</p><p>Note that &#181;</p><p>r equals &#181;(&#8706;B r (&#958;)) -the sum of all the edges that connect vertices in B r (&#958;) and its complement in Z, where &#181; is a measure on the edges in the boundary &#8706;B r (&#958;). Next we define the isoprimetric inequality.</p><p>unseen labels:</p><p>We, however, skip this as the alignment issue is beyond our focus in this paper. In practice, we use the clustering of the regularized semantic attributes and the mapping found by using the Kuhn-Munkres algorithm, similar to <ref type="bibr">(Trigeorgis et al. )</ref>. This does not have any impact on our method.</p><p>Definition 1 <ref type="bibr">(Chung, Grigor'yan, and Yau 1999)</ref> We say that a graph G has an isoperimetric dimension &#948; with a constant c &#948; , if, for every bounded subset B r (&#958;) of Z, the number of edges between B r (&#958;) and the complement of B r (&#958;),</p><p>where &#8706;B r (&#958;) denotes the boundary of B r (&#958;).</p><p>In our notation, we have that &#8706;B r (&#958;)=&#181;</p><p>r . Next, we define the isoperimetric gap using the isoperimetric inequality above, which is the quantity to be minimized in the isoperiemtric loss:</p><p>Definition 2 The isoperimetric gap is defined as</p><p>To minimize the gap we propose solving the following optimization problem:</p><p>where f s i u ,s j u is a function of the embedding distance between s i u and s j u (Specifically, f : S &#8594; R, where S is the semantic embedding, and f is the Euclidean distance; other choices are also possible) and &#955; is a positive scalar tuning parameter. Note that the gap &#946; depends on &#948;, the isoperimetric dimension, which is unknown, and will have to be approximated.</p><p>Approximating the IPL To approximate the IPL loss, we elaborate on a spectral reduction of the isoperimetric loss, which provides a fast alternative to solving (14) directly in the vertex domain. Approximating the IPL loss using a direct method to solve (14) would entail approximating the isoperimetric dimension of the graph, which is challenging in general and even more for graphs constructed from noisy high dimensional data. Therefore, we choose to focus on a spectral reduction method. Spectral Reduction We introduce a spectral reduction method for the isoperimetric inequalities, which reduces the isoperimteric gap directly in the spectral domain by using the vertex localization of Spectral Graph Wavelets (SGW) <ref type="bibr">(Hammond, Vandergheynst, and Gribonval 2011)</ref>. Specifically, we use the Spectral Graph Wavelet Transform of the semantic embedding space s u for each of the semantic dimensions e of s u .</p><p>Let s i u (e) be a component of s i u in a fixed dimension e Step 2 Assign semantic attributes s i u to its corresponding node i.</p><p>Step 1: Construct the unnormalized Laplacian L = D -W using G =(Z, W ). Take the SGW transform (Eq. 15) for each dimension i of of s u .</p><p>Step 2 Apply regularization using the spectral reduction of the IPL loss.</p><p>Step 3 Construct a new graph using the regularized &#349;u . Output: A new Semantic embedding graph G =(S, W s ) a filtered signal of a fixed dimension e of s i u , where &#966; l is the corresponding eigenvector of the unnormalized Laplacian L associated with eigenvalue &#955; l . The coefficients a k are constants of a polynomial function and for a specific choice correspond to spectral graph wavelet (SGW) coefficients. Note that the terms r k=0 a k (L k ) i,j can be interpreted as the localized spectral transform of the graph around the ball B r (z i u (i)), which vanishes for all z j u / &#8712; B r (z i u ). With the SGW transform, we employ a redundant representation with r polynomials &#954; t(k) (&#955;), 1 &#8804; k &#8804; r, each is approximating a kernel function localized in a frequency bands with corresponding scaling t(k). Let &#948; i be the impulse unit vector for the vertex i. Fixing a scale t(k), the SGW coefficients &#968; t(k) i can be realized by &#968;</p><p>i,j indicate that j is informative about i, where small negative or zero values indicate insignificant influence with respect to the scale t(k). Next choose the smallest r 0 , 1 &lt;r 0 &lt;rwhere we have &#968; t(r0) i,j &#8804; 0 for all j, with the corresponding polynomial &#954; t(r0) (&#955;). Then, for all SGW coefficients s i u (e) in bands k &#8805; r 0 , we annihilate all terms s i u (e), which has the effect of shrinking the boundaries of each ball around each vertex i and thus reducing the isoperimetric loss directly in the spectral domain. Take the inverse transform to obtain the denoised signal and construct the new graph from the regularized semantic embedding space &#349;u .</p><p>The algorithm is summarized in pseudo code in Algorithm 1.</p><p>Clustering and validation in ZSL We employ a standard procedure for spectral clustering as follows: the input is the graph G =( S, W s ) obtained from applying the IPL algorithm, and the number of clusters, n u . Construct the Laplacian matrix L s using W s and compute the first n u eigenvectors of L s . Letting &#934; &#8712; R Nu&#215;nu correspond to the first n u eigenvectors of L s stacked in a matrix form, we cluster the vectors &#934; i &#8712; R nu ,i =1 , ..N u corresponding to the rows of &#934; into n u classes, using the k-means algorithm. Once n u clusters are found, we can associate each of them to a different unseen label. While it is not required that the semantic embedding of the unseen labels s(y u ) correspond to the clusters in the same space, mapped from the visual embeddings, this alignment can be performed post-hoc. For the purpose of comparison, however, it is sufficient to perform the assignment by searching over permutations of the unknown labels. Since we have at most 50 unseen labels in our experiments, this is not a bottleneck. More in general, one may consider introducing the alignment as part of the regularization, but this is beyond our scope in this paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Results</head><p>Experimental Settings. In the first set of experiments, we restrict our comparisons to approaches that are fully automated beyond the definition of the visual embedding (best performance is marked in boldface). In addition, we also report the evaluation of the state of the art methods that have access to embeddings of ground-truth semantic attributes.</p><p>Using our approach we choose each component of our ZSL pipeline to be the simplest possible one, corresponding to the baseline (Romera-Paredes and Torr 2015). A sanity check is whether our proposed regularization scheme improves over this baseline. Ideally, however, our method would take the baseline beyond the state-of-the-art.</p><p>To test this hypothesis, we use the two most common benchmarks for ZSL, AwA and CUB. AwA (Animals with Attributes) consists of 30,745 images of 50 classes of animals and has a source/target split of 40 and 10 classes, respectively. In addition we test on the new released dataset AWA2 which consists of 37,322 images of 50 classes which is an extension of AwA (which will be refereed from now and on AwA1). AwA2 also has source/target of 40 and 10 classes respectively with a number of 7913 unseen testing classes. We used the proposed new splits for AwA1 and AwA2 <ref type="bibr">(Xian, Schiele, and Akata 2018)</ref>.</p><p>The CUB dataset contains 200 different bird classes, with 11,788 images in total. We use the standard split <ref type="bibr">(Changpinyo et al. 2016)</ref> with 150 classes for training and 50 disjoint classes for testing <ref type="bibr">(Xian, Schiele, and Akata 2018)</ref> which is employed in most automated based methods we compare to, while <ref type="bibr">(Xian, Schiele, and Akata 2018</ref>) also suggested a new split for the CUB dataset. Note that the CUB dataset is considered fine-grained, hence more challenging with both of the input features (visual and semantic) being very noisy. We present the evaluations in Tables 1, 3 and 4 using methods which are either representative or competitive for ZSL using automated attributes including <ref type="bibr">(Deutsch et al. 2017;</ref><ref type="bibr">Kodirov et al. 2015;</ref><ref type="bibr">Xian et al. 2016;</ref><ref type="bibr">Frome et al. 2013;</ref><ref type="bibr">Rahman, Khan, and Porikli 2018;</ref><ref type="bibr"/> Method/Data AwA1AwA2 EZSL (Romera-Paredes and Torr 2015)</p><p>58.2 58.6 SJE <ref type="bibr">(Akata et al. 2015)</ref> 65.6 61.9 ALE <ref type="bibr">(Akata et al. 2016b)</ref> 59.9 62.5 LatEm <ref type="bibr">(Xian et al. 2016)</ref> 50.8 -DEVISE <ref type="bibr">(Frome et al. 2013)</ref> 50.4 -SynC <ref type="bibr">(Changpinyo et al. 2016)</ref> 58.6 -CAPD (Rahman, Khan, and Porikli 2018) 64.73 -Kernel ZSL <ref type="bibr">(Zhang and Koniusz 2018)</ref> 71.0 70.51 DEM <ref type="bibr">(Zhang, Xiang, and Gong 2017)</ref> 68.4 67.1 RELATION NET <ref type="bibr">(Sung et al. 2018)</ref> 68.2 64.2 QFSL <ref type="bibr">(Song et</ref>   <ref type="bibr">Changpinyo et al. 2016)</ref> as well as ones that used human annotation <ref type="bibr">(Zhang and Koniusz 2018;</ref><ref type="bibr">Song et al. 2018;</ref><ref type="bibr">Zhang, Xiang, and Gong 2017;</ref><ref type="bibr">Sung et al. 2018</ref>) for a more general overview.</p><p>Implementation details: For all the splits of AwA and CUB datasets, we fix k =1 5 , r =3 , and k =8 , r =3 for the k nearest neighbor graph parameter and radius r of the ball around each point, respectively. The edges w ij are chosen using the cosine similarity between the visual observations.</p><p>Experimental results on the AwA1 and AwA2 datasets using the new proposed splits <ref type="bibr">(Xian, Schiele, and Akata 2018)</ref> are shown in Table <ref type="table">1</ref>. Note that the new proposed AWA2 data-set is more challenging, as evident from the significant drop in performance compare to AwA for most of the state of the art methods. We also compare to state of the art methods which are employing human attributes (85 dimensional attribute vectors provided for each class in (Lampert, Nickisch, and Harmeling 2009)). A "-" indicates that the performance of the method was not reported in the lit- erature for the corresponding dataset. Mean average precision of the baseline is 58.6%. We improve it to 76.03% and 73.46% on the AwA1 and AwA2 datasets, respectively, by using our regularizer, taking the baseline past the state of the art, which is 73.7 % and 72% using <ref type="bibr">(Deutsch et al. 2017)</ref> on AwA1 and AwA2 respectively, reducing the error by 3.06 percentage points on AwA1. Note that among the most competitive state of the art methods which is also using automated attributes, <ref type="bibr">(Deutsch et al. 2017</ref>) is using a much more complex and computationally heavy method. Furthermore, for both AwA1 and AwA2, our method outperforms the state of the art methods which are using human attributes. Fig. <ref type="figure">3</ref> shows a comparison between the t-sne embedding of the noisy embedded semantic representation and the regularized semantic representation using IPL.</p><p>Experimental results on the CUB dataset is the next benchmark we consider. The baseline achieves a disappointing 23.8% precision on CUB. Surprisingly, our regularizer takes it past the state-of-the-art automatic method (36.9%), to 39.45%, corresponding to an error decrease of over 6.9%. The experimental results comparison on the CUB dataset is shown in Table <ref type="table">3</ref>. Influence of the k nearest neighbor graph parameter We also provide additional experiments which test the influence of the k nearest neighbor graph parameter. Changing the k nearest neighbor graph parameter by 50% for the AWA1, AWA2, and CUB datasets, results in a performance drop of less than 2.6, 8.77, and 2.02%, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Comparison to Learning Graph Methods</head><p>In addition to direct comparison to ZSL methods, since our approach uses a graph-based smoothing approach, one may wonder whether applying state-of-the-art graph learning methods one might also improve performance of ZSL.</p><p>We focused on the state-or-the-art graph learning method <ref type="bibr">(Egilmez, Pavez, and Ortega 2017b)</ref> and also <ref type="bibr">(Kalofolias 2016)</ref> as representative of graph based learning. We use the same method and protocol to arrive at a graph, but replace our approach with (Egilmez, Pavez, and Ortega 2017b) <ref type="bibr">(Kalofolias 2016)</ref> for evaluation in AwA1, AwA2, and CUB. This experiment is comparable to the one reported in Table 1. While performance in CUB is comparable, our approach outperforms <ref type="bibr">(Egilmez, Pavez, and Ortega 2017b;</ref><ref type="bibr">Kalofolias 2016</ref>) on the AwA benchmarks.</p><p>Experimental results on generalized ZSL (GZSL) We also compare our performance in the generalized zero shot learning setting. We follow the standard protocol of the generalized ZSL (GZSL) (Schiele and Akata 2017) settings where the search space at evaluation time includes both the target and the source classes while the evaluation metric use the harmonic mean between while source and test data as the evaluation metrics. Thus, letting Acc s ,Acc t the mean class accuracy achieved for the source and target classes, respectively, the harmonic mean H is given by:</p><p>The settings of the GZSL is more challenging, as can be Method/Data CUB EZSL (Romera-Paredes and Torr 2015)</p><p>23.8 SJE <ref type="bibr">(Akata et al. 2015)</ref> 28.4 LatEm <ref type="bibr">(Xian et al. 2016)</ref> 33.1 Less Is more <ref type="bibr">(Qiao et al. 2016)</ref> 29.2 ALE <ref type="bibr">(Akata et al. 2016b)</ref> 54.9 Kernel ZSL <ref type="bibr">(Zhang and Koniusz 2018)</ref> 57.1 DEM <ref type="bibr">(Zhang, Xiang, and Gong 2017)</ref> 51.7 RELATION NET <ref type="bibr">(Sung et al. 2018)</ref> 55.6 QFSL <ref type="bibr">(Song et al. 2018)</ref> 72.1 LisGAN <ref type="bibr">(Li et al. 2019)</ref> 58.8 GMN <ref type="bibr">(Sariyildiz and Cinbis 2019)</ref> -CAPD <ref type="bibr">(Rahman, Khan, and Porikli 2018)</ref> 32.08 Multi-Cue ZSL <ref type="bibr">(Akata et al. 2016a)</ref> 32.1 DMaP <ref type="bibr">(Li et al. 2017)</ref> 30.34 MSMR <ref type="bibr">(Deutsch et al. 2017)</ref> 36.9 Proposed 39.45</p><p>Table <ref type="table">3</ref>: Mean average precision accuracy (top-1 in %) results using our method compared to the state of the art methods in zero shot learning on the CUB dataset using Word2Vec or other automated semantic representation. Methods using automated semantic representation are marked in boldface. The evaluation for the state of the art methods which are using human semantic annotation is also presented.</p><p>seen in the evaluation comparison (Table <ref type="table">4</ref>, for most methods the performance degrades significantly in comparison to the standard ZSL. We compare to the recent automated methods tested in the GZSL settings on the AwA1, AwA2 and CUB datasets (those which are available on GZSL and can scale to generalized ZSL settings). The experimental results summarized in Table <ref type="table">4</ref> show not only improvement over method using automatic attributes (error decrease of over 9.8% and 44% for the AWA1 and CUB datasets), but is also outperforming many of the recent state of the art methods which are using human annotation.</p><p>Computational complexity: The IPL with spectral reduction has a computational cost of O(N u K log(N u )), which includes fast computation of the k nearest neighbor graph using kd tree, the SGW transform which is O(N u ) for each dimension of the manifold for sparse graphs <ref type="bibr">(Hammond, Vandergheynst, and Gribonval 2011)</ref>, thus total complexity of O(N u K log(N u )) for N u samples in K dimensional space. Limitations includes processing very large graphs which consists more than millions of points. The execution time of our code implementation using Intel Core i7 7700 Quad-Core 3.6GHz with 64B Memory on the AWA1 dataset with 5685 points using k =1 0nearest neighbor graph takes &#8776; 21.9 seconds for the initialization of the visual-semantic embedding space, and &#8776; 44.8 seconds for our IPL regularization.</p><p>Method/Data AwA1AwA2CUB EZSL <ref type="bibr">(Romera-Paredes and Torr 2015)</ref> 12.1 11.0 21.0 SJE <ref type="bibr">(Akata et al. 2015)</ref> 19.6 14.4 33.6 LatEm <ref type="bibr">(Xian et al. 2016)</ref> 13.3 20.0 24.0 ALE <ref type="bibr">(Akata et al. 2016b)</ref> 27.5 23.9 34.4 DEVISE <ref type="bibr">(Frome et al. 2013)</ref> 22.4 27.8 32.8 SynC <ref type="bibr">(Changpinyo et al. 2016)</ref> 16.2 18.0 19.8 DMaP <ref type="bibr">(Li et al. 2017)</ref> 6.44 -2.07 CAPD <ref type="bibr">(Rahman, Khan, and Porikli 2018) 43.70 -31.6</ref> Kernel ZSL <ref type="bibr">(Zhang and Koniusz 2018)</ref> 29.8 30.8 35.1 DEM <ref type="bibr">(Zhang, Xiang, and Gong 2017)</ref> 47.3 45.1 13.6 RELATION NET <ref type="bibr">(Sung et al. 2018)</ref> 46.7 45.3 47.0 LisGAN <ref type="bibr">(Li et al. 2019)</ref> 62.3 -51.6 GMN <ref type="bibr">(Sariyildiz and Cinbis 2019)</ref> 74.8 -65.0 QFSL <ref type="bibr">(Song et</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head><p>We have introduced the use of isoperimetric inequalities, known for centuries, into clustering in general, and zero-shot learning in particular. We use the isoperimetric loss to indirectly regularize a learned map from visual representations of data to their semantic embedding. Regularization is done by representing the domain of the map as a graph, the map as a graph signal, and regularizing the graph, obtaining another "denoised" graph where clustering is performed to reveal the unseen labels, once cluster-to-label association is performed. This regularization appears to be so effective as to take the simplest possible ZSL approach, where all maps are assumed linear, and improve it to beyond the current stateof-the-art for fully automatic ZSL approaches. Typical failure modes of our regularization and clustering algorithms are when the compactness values of the geodesic flows lie within a very wide range of different intervals corresponding to the bounded sets of the different input classes, which will result in incorrect estimation and detection of the geodesic flows and therefore ineffective regularization. Since our model is general, it could be used in conjunction with more sophisticated ZSL components, including those where the various maps are not linear, and learned jointly with regularization.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>'A misnomer, since one knows ahead of time what these labels are, for instance 3 = "sailboat", 4 = "car." However, one is not given images with those labels during training. So, while the labels are seen, image samples with those labels are not seen during training.</p></note>
		</body>
		</text>
</TEI>
