<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Representing Joint Hierarchies with Box Embeddings</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10188900</idno>
					<idno type="doi"></idno>
					<title level='j'>Automated Knowledge Base Construction</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shib Sankar Dhruvesh Patel</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Learning representations for hierarchical and multi-relational knowledge has emerged as an active area of research. Box Embeddings  [Vilnis et al., 2018, Li et al., 2019] represent concepts with hyperrectangles in -dimensional space and are shown to be capable of modeling tree-like structures efficiently by training on a large subset of the transitive closure of the WordNet hypernym graph. In this work, we evaluate the capability of box embeddings to learn the transitive closure of a tree-like hierarchical relation graph with far fewer edges from the transitive closure. Box embeddings are not restricted to tree-like structures, however, and we demonstrate this by modeling the WordNet meronym graph, where nodes may have multiple parents. We further propose a method for modeling multiple relations jointly in a single embedding space using box embeddings. In all cases, our proposed method outperforms or is at par with all other embedding methods.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Hierarchical relations, particularly hypernymy, are inherently present in natural language and useful in many common tasks. For example, in SNLI the sentence "A soccer game with multiple males playing" can be understood to entail "Some men are playing a sport" via the hypernym relation between "soccer" and "sport". Question answering and language modeling, in general, can benefit from understanding these relationships, as a word can often be replaced by its hypernym and still yield a valid sentence or provide supporting evidence to answer a question. Vector embedding methods, where an entity is associated with a vector in R n , often use symmetric measures such as the dot product between these representations which cannot capture inherently asymmetric relations such as hypernymy.</p><p>In order to solve this problem, <ref type="bibr">[Vendrov et al., 2016]</ref> introduced an asymmetric measure on vectors in the form of entailment cones in R n + . <ref type="bibr">[Vilnis et al., 2018</ref><ref type="bibr">, Li et al., 2019</ref>] generalized this notion to propose box embeddings which are hyperrectangles in R n . We build upon this work by proposing a new regularization loss for the box embedding model. This provides a weak scale on the embedding space resulting a significant performance improvement.</p><p>In this work, we explore the extent to which the edges from the transitive closure are necessary to model the tree-like large tree-like graphs, e.g, WordNet's hypernymy. We further explore the meronymy relation from WordNet, which differs qualitatively from the hypernymy graph in that it is not as tree-like, comprised of many connected components where some nodes have multiple parents while others have no parents. We not only achieve state-of-the-art performance with our box embedding based method over other baselines when 10, 25, and 50% of the transitive closure edges are supplied during training but, most importantly, we outperform other methods by a significant margin when the training is being restricted to the transitive reduction only (0% transitive closure edges).</p><p>Finally, we introduce a method of modeling these two relations in the same space by introducing two representations for each entity. Our method is not specific to box embeddings, but rather can be thought of as a graph augmentation procedure which is general enough to allow for modeling with all existing baselines. In this multi-relational embedding settings, we also compare against knowledge base completion baselines and demonstrate that box embeddings can outperform TransE <ref type="bibr">[Bordes et al., 2013]</ref> and ComplEx <ref type="bibr">[Trouillon et al., 2016]</ref> in this setting as well * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Using distributed representations to capture semantic relationships amongst words/concepts has had a long history <ref type="bibr">[Mikolov et al., 2013</ref><ref type="bibr">, Pennington et al., 2014]</ref>. However, due to the use of a symmetric distance metric, these models are incapable of representing the asymmetric relations like hypernymy. Recent works that address this issue can be broadly categorized into two approaches: (1) using point representations combined with an asymmetric distance measure, and (2) using region-based representations with a probabilistic interpretation.</p><p>While <ref type="bibr">Nickel and Kiela [2017]</ref>, <ref type="bibr">Chamberlain et al. [2017]</ref> use points in n-dimensional hyperbolic space (Poincar&#233; ball) to model tree-structured data, <ref type="bibr">Vendrov et al. [2016]</ref> uses the asymmetric measure of the reverse product order on points in Euclidean space to perform the same task. Using the reverse product order amounts to associating with each entity a Euclidean cone with apex x, which is the set {y : n i=1 y i &#8805; x i }. <ref type="bibr">Ganea et al. [2018]</ref> extend the hyperbolic point embeddings to entailment cones in hyperbolic space.</p><p>Another line of work uses distributions over points in space to learn representations for hierarchical data <ref type="bibr">[Vilnis and</ref><ref type="bibr">McCallum, 2015, Muzellec and</ref><ref type="bibr">Cuturi, 2018]</ref>. <ref type="bibr">Lai and Hockenmaier [2017]</ref> provide a probabilistic interpretation of order embeddings <ref type="bibr">[Vendrov et al., 2016]</ref> by using the negative exponential measure or, equivalently, representing concepts as a cone in [0, 1] n . With this parameterization, the volume of a cone could be interpreted as the marginal (resp. joint) probability of the entity (resp. intersection) which it represents. This allowed for training and evaluation using conditional probabilities which mapped very naturally onto hypernym relations, eg. P (mammal|dog) = 1. However, representing entities probabilistic cones has a serious deficit -it cannot represent negative correlation, that the case when P (X|Y ) &#8805; P (X). Box embeddings, introduced in <ref type="bibr">Vilnis et al. [2018]</ref>, solves this issue by introducing another vector, effectively associating each entity with an n-dimensional hyper-rectangle. Their method has been shown to effectively model hypernymy by training on a large subset of the transitive closure of the WordNet hypernym graph. In this work, we propose a regularization loss to further improve the performance of box embeddings.</p><p>We also present a modelling technique that extends the previously discussed methods to jointly model multiple hierarchical relations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>We first present box embeddings as they were introduced in <ref type="bibr">[Vilnis et al., 2018]</ref>, briefly touch on the "soft" volume calculation presented in <ref type="bibr">[Li et al., 2019]</ref>, and then discuss some learning adaptations present in our version of the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Representation</head><p>The box-space <ref type="bibr">[Vilnis et al., 2018]</ref> is the set of all n-dimensional hyperrectangles. Formally,</p><p>a} is the set of all closed intervals of R, then an n-dimensional box-space is B n . Just as in vector embedding models where each entity is represented by a vector in the euclidean space, in box-space each entity e is represented by a box &#945; e &#8712; B n as shown in Figure <ref type="figure">1</ref>. We parametrize a box using the minimum and maximum coordinates in each dimension, (&#945; m , &#945; M ), where &#945; m , &#945; M &#8712; R n and &#945; m,i &#8804; &#945; M,i for each i &#8712; {1, . . . , n}. Box embeddings provide a natural way to embed a partial order using containment. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Probabilistic Interpretation</head><p>In addition to the natural containment operation, we can consider the volume of boxes. If we normalize the space to have volume 1 (by dividing by the smallest containing box, say) the boxes can be interpreted as a parameterization of a joint probability distribution over binary random variables, where the marginal probability of a given entity is given by the volume of its box. Intersections of boxes yield either another box or the empty set, and in either case, we can compute the volume of the intersection as</p><p>We can interpret this volume as the joint probability for concepts &#945; and &#946;, and furthermore calculate (for example) &#8224;</p><p>(2) <ref type="bibr">Vilnis et al. [2018]</ref> uses these conditional probabilities to model the hypernym graph of WordNet. For example, given an edge from Man to Person (indicating that a man is a person) we first convert this to a conditional probability between binary random variables, Pr(Person|Man) = 1, which is then trained using binary cross-entropy loss and stochastic gradient descent. Figure <ref type="figure">2</ref> shows the corresponding boxes capturing this edge. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Softplus Volume</head><p>Box embeddings that should overlap may become disjoint, either due to initialization or during training, and the currently described model has no training signal in this instance. <ref type="bibr">Vilnis et al. [2018]</ref> made use of a surrogate loss function to handle this case, however <ref type="bibr">Li et al. [2019]</ref> introduced a new method which replaces the ReLU from (1) with the softplus function, softplus t (x) = t log(1 + e x/t ), which smoothed the loss landscape and resulted in easier training. The volume calculation under this approximation becomes</p><p>We note that (3) is always positive, which means all boxes "intersect" one another under this volume calculation. The temperature parameter t can be tuned, and as t &#8594; 0 we recover the original box model. We will exclusively use this soft volume calculation when computing box volumes, and in our experiments, we always select the temperature based on dev set performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Absence of explicit probability labels and use of regularization</head><p>Previous work on box embeddings <ref type="bibr">[Vilnis et al., 2018]</ref>    model. These marginal probabilities were calculated using the graph structure -leaves were given marginal probability equal to 1/#nodes, and parents were assigned marginal probability equal to the sum of their children plus 1/#nodes. How "tight" the parent boxes are to their children can be seen as a form of regularization. In general, setting marginal probabilities can help to set a scale for the overall model. For our model, however, we avoid pre-computing marginal probabilities and simply use the conditional probability targets with binary cross-entropy loss and augment it with an additional regularization loss which can be written as</p><p>N being the total number of examples (including the random negatives), N e the total number of entities, y i the label, p i the conditional probability calculated using equation 2, and I the indicator function.</p><p>Without any global scale i.e., the marginal probabilities, we get poor training results, so we introduce this regularization measure by penalizing the size of boxes when they become greater than a fixed volume. This provides a weak scale to the boxes in the embedding space and results in significant performance improvement over unregularized training without the need to precompute unary marginals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Modeling joint hierarchy</head><p>Inter-relational semantics implicit in multi-relational knowledge graphs can be used to learn coherent representations for the entities present in the graph. In particular, the relations discussed in this work, i.e., the IsA and HasPart relations should obey the following rules.</p><p>For instance, every Man IsA Person and every Person HasPart Eye. Hence, every Man HasPart Eye. Similarly, Eye IsA Organ and Person HasPart Eye implies Person HasPart Organ. Figure <ref type="figure">4</ref> shows the subgraph describing this example. The red edges and the dotted red edges are implied HasPart edges which might be missing in the original knowledge graph. We show that boxes can model these two relations in a single space by using two boxes per node while preserving the semantics described above.</p><p>For each node x we embed two boxes, one which represents "Things which are x" and one which represents "Things which have x as a part". This is depicted in Figure <ref type="figure">3</ref>, where the entity "Man" is represented by the green Man box and the red HasPart-Man box. The edges 3,4,5 and 7, in Figure <ref type="figure">4</ref>  This form of modeling multiple relations using augmentation is not limited to the box embedding model. The main idea is to create a new graph with twice as many nodes and edges from the IsA hierarchy, with additional PartOf edges so as to make the semantics coherent. Any model capable of modeling a DAG can be evaluated on its ability to model this graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head><p>In order to demonstrate the efficacy of the box embeddings model on hierarchical structured data, we evaluate on the hypernymy and meronymy relations of WordNet. Following the recent works of <ref type="bibr">[Nickel and</ref><ref type="bibr">Kiela, 2017, Ganea et al., 2018]</ref>, we train on the transitive reduction, using increasing amounts of edges from the transitive closure of those lexical relations. We pose the problem as a binary classification task on the unseen test edges which include the remaining edges from the transitive closure along with a fixed set of random negatives for the respective relations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Datasets</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Hypernymy (IsA)</head><p>For hypernymy, we used the dataset from <ref type="bibr">[Ganea et al., 2018]</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Meronymy (HasPart)</head><p>We created a meronomy dataset in the exact same way. Out of the 82,114 nodes in WordNet, only 11,235 of them have meronym relations associated with them. We note that this graph is not nearly as tree-like -there are 2,083 connected components, 2,860 nodes with no parents and, 1,013 nodes with multiple parents. Again, following <ref type="bibr">[Ganea et al., 2018]</ref>, we start with the transitive reduction and add 0%, 10%, 25%, and 50% of the transitive closure to the training data in order to observe the corresponding improvement in the performance of the model. Dev and test datasets were created in the same way as for hypernymy. Please refer to Table <ref type="table">1</ref> for the details of the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Joint Hierarchy (IsA and HasPart)</head><p>For the joint hierarchy, we start with the transitive reduction of meronymy and hypernymy and create a new graph as described in Section 3.5. Specifically, we add the implied HasPart edges (shown as the red arrows in Figure <ref type="figure">4</ref>). The test data is comprised of a subset of the implied edges (shown as dotted red in Figure <ref type="figure">4</ref>) which are of the form</p><p>The dev and test created in this manner have 94807 and 94806 positive edges respectively which include all HasPart edges as described in ( <ref type="formula">7</ref>) for all m, n &#8712; {1, 2}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Baselines</head><p>We compared our model with the recent strong baseline models such as:</p><p>1. Order Embeddings <ref type="bibr">[Vendrov et al., 2016]</ref>: As mentioned in the introduction, this method embeds entities as cones in R+ d . Representationally, order embeddings are equivalent to box embeddings where the min coordinate is fixed to the origin, however, learning differs in that the objective function optimizes an order-violation penalty.</p><p>(See <ref type="bibr">[Vendrov et al., 2016]</ref> for more details.)</p><p>2. Poincar&#233; Embeddings <ref type="bibr">[Nickel and Kiela, 2017]</ref>: In this work, the authors learn embeddings in an n-dimensional Poincar&#233; ball which naturally captures the tree-like structure via the constant negative curvature of hyperbolic geometry. This work uses a symmetric distance function which is not ideal for asymmetric relational data, however. We use the same heuristics followed by <ref type="bibr">[Ganea et al., 2018]</ref> to mitigate this problem.</p><p>3. Hyperbolic Entailment Cones <ref type="bibr">[Ganea et al., 2018]</ref>: This work models hierarchical relations as partial orders defined using a family of nested geodesically convex cones in hyperbolic space.</p><p>4. TransE and ComplEx <ref type="bibr">[Bordes et al., 2013</ref><ref type="bibr">, Trouillon et al., 2016]</ref>: The joint hierarchy can be also considered as multi-relational data with two relations. We use the most popular translational distance-based and matrix factorization based models, TransE and ComplEx respectively.</p><p>In order to show the effectiveness of the proposed regularisation, we include the performance of the vanilla box embeddings <ref type="bibr">[Li et al., 2019]</ref> as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Training details</head><p>In all our experiments, we keep the embedding dimension as 10 for all the baseline models, while for our model we use embedding dimension as 5. This results in having 10 parameters per node and ensures a fair comparison. For the task of learning the joint hierarchy with multi-relation prediction models like TransE and ComplEx, we keep their embedding dimension as 20 to account for the doubling of parameters due to the introduction of two embeddings per node in Order embeddings, Poincar&#233; embeddings, Hyperbolic entailment cones, and our model. The effect of increasing the embedding dimensions is reported in Appendix B &#8225; . We use the validation set to find the best threshold for the classification and obtain the F1 score on the test set using this threshold. Our best performing models have softbox temperature of between 0.2 and 0.5. We obtain the best learning rate for every model using random search and performance on the validation set &#167; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>The F1 scores for the hypernymy and meronymy predictions are presented in Table <ref type="table">2</ref>. We note that our box embedding method outperforms all baselines by a large margin in the single hierarchy settings. Particularly, in the transitive reduction (0%) setting for hypernymy, we achieve 40% relative improvement compared to the most competitive baseline (order embeddings). This demonstrates the capability of box embeddings to reconstruct the whole transitive closure graph given only the transitive reduction. We observe a similar trend when modeling meronymy. When transitive closure edges are added to the training data (10%, 25%, 50%), the gap between our method and the baselines closes, however, the former still remains superior.</p><p>In Table <ref type="table">3</ref>, we report the F1 scores for the test edges of the joint hierarchy, which are the composite edges created by the rules (7) mentioned in Section 4.1. We outper- &#8225;. The appendix is available at <ref type="url">https://github.com/iesl/Boxes_for_Joint_hierarchy_AKBC_2020/ blob/master/supplementary_material/appendix.pdf</ref> &#167;. We use Weights &amp; Biases package to manage our experiments <ref type="bibr">[Biewald, 2020]</ref>.   the inductive bias required to model such highly connected relations, the hyperbolic space methods struggle due to their inability to train very well. Order embeddings, on the other hand, slightly outperform our model, but the latter is essentially at par while also achieving superior performance on the task of modeling single hierarchy. The high performance of order embeddings on this task can be attributed to the "local" nature of the evaluation (we traverse only 2 steps up and down the hypernym graph by keeping m, n &#8712; {1, 2} in eq. 7) which might not be exposing the representation limitations of order embeddingssomething that is exposed while modeling the relations individually.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Qualitative analysis</head><p>In order to visualize the effect of using "soft" boxes, we train a 2-dimensional box embedding model on the hypernym relation of WordNet. Due to the softening of the boxes, the model does not require the dog.n.01 box to be completely contained inside the domestic animal.n.01 box to produce a high score for the edge between them. Figure <ref type="figure">5</ref> shows the visualization obtained by plotting the hard edges using the minimum and maximum coordinates of the softboxes. For the Joint Hierarchy model, we obtain similar visualizations which are given in Appendix A &#8225; . In order to model a hierarchy, the embedding model has to accommodate the exponential growth in the number of nodes while moving down the hierarchy. The loss function for the box embedding model encourages the embedding of the head node to contain the embedding of the tail node. Hence the number of embeddings having low volume should be high. This intuition is confirmed by Figure <ref type="figure">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this paper we have explored the capability of box embeddings to model hierarchical data. We demonstrated that it provides superior performance compared to alternative methods and requires less of the transitive closure. We also demonstrated that boxes are capable of learning data which is less tree-like, and introduced a method of embedding a joint hierarchy in a single space by augmenting the graph. There are several promising directions which can be explored in future work. It would be of interest to expand the number of relations which are modeled in a single box space and to extend the method to model non-transitive relations. Also, the knowledge encoded in these embeddings can be exploited by using them as a representation layer in neural network models for various downstream tasks like question answering, natural language inference, etc.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#8224;. Note that when calculating conditional probabilities it is not necessary to normalize the volume of the space to size 1.</p></note>
		</body>
		</text>
</TEI>
