Taxonomies are fundamental to many real-world applications in various domains, serving as structural representations of knowledge. To deal with the increasing volume of new concepts needed to be organized as taxonomies, researchers turn to automatically completion of an existing taxonomy with new concepts. In this paper, we propose TaxoEnrich, a new taxonomy completion framework, which effectively leverages both semantic features and structural information in the existing taxonomy and offers a better representation of candidate position to boost the performance of taxonomy completion. Specifically, TaxoEnrich consists of four components: (1) taxonomy-contextualized embedding which incorporates both semantic meanings of concept and taxonomic relations based on powerful pretrained language models; (2) a taxonomy-aware sequential encoder which learns candidate position representations by encoding the structural information of taxonomy; (3) a query-aware sibling encoder which adaptively aggregates candidate siblings to augment candidate position representations based on their importance to the query-position matching; (4) a query-position matching model which extends existing work with our new candidate position representations. Extensive experiments on four large real-world datasets from different domains show that TaxoEnrich achieves the best performance among all evaluation metrics and outperforms previous state-of-the-art methods by a large margin.
more »
« less
Completing Taxonomies with Relation-Aware Mutual Attentions
Taxonomies serve many applications with a structural representation of knowledge. To incorporate emerging concepts into existing taxonomies, the task of taxonomy completion aims to find suitable positions for emerging query concepts. Previous work captured homogeneous token-level interactions inside a concatenation of the query concept term and definition using pre-trained language mod- els. However, they ignored the token-level interactions between the term and definition of the query concepts and their related concepts. In this work, we propose to capture heterogeneous token-level interactions between the different textual components of concepts that have different types of relations. We design a relation-aware mutual attention module (RAMA) to learn such interactions for taxonomy completion. Experimental results demonstrate that our new taxonomy completion framework based on RAMA achieves the state-of-the-art performance on six taxonomy datasets.
more »
« less
- PAR ID:
- 10433828
- Date Published:
- Journal Name:
- KDD
- ISSN:
- 2154-817X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications. For example, online retailers (e.g., Amazon and eBay) use taxonomies for product recommendation, and web search engines (e.g., Google and Bing) leverage taxonomies to enhance query understanding. Enormous efforts have been made on constructing taxonomies eithermanually or semi-automatically. However, with the fast-growing volume of web content, existing taxonomies will become outdated and fail to capture emerging knowledge. Therefore, in many applications, dynamic expansions of an existing taxonomy are in great demand. In this paper, we study how to expand an existing taxonomy by adding a set of new concepts. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of ⟨query concept, anchor concept⟩ pairs from the existing taxonomy as training data. Using such self-supervision data, TaxoExpan learns a model to predict whether a query concept is the direct hyponym of an anchor concept. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data. Extensive experiments on three large-scale datasets from different domains demonstrate both the effectiveness and the efficiency of TaxoExpan for taxonomy expansion.more » « less
-
Automatic construction of a taxonomy supports many applications in e-commerce, web search, and question answering. Existing taxonomy expansion or completion methods assume that new concepts have been accurately extracted and their embedding vectors learned from the text corpus. However, one critical and fundamental challenge in fixing the incompleteness of taxonomies is the incompleteness of the extracted concepts, especially for those whose names have multiple words and consequently low frequency in the corpus. To resolve the limitations of extraction-based methods, we propose GenTaxo to enhance taxonomy completion by identifying positions in existing taxonomies that need new concepts and then generating appropriate concept names. Instead of relying on the corpus for concept embeddings, GenTaxo learns the contextual embeddings from their surrounding graph-based and language-based relational information, and leverages the corpus for pre-training a concept name generator. Experimental results demonstrate that GenTaxo improves the completeness of taxonomies over existing methods.more » « less
-
Taxonomies are of great value to many knowledge-rich applications. As the manual taxonomy curation costs enormous human effects, automatic taxonomy construction is in great demand. However, most existing automatic taxonomy construction methods can only build hypernymy taxonomies wherein each edge is limited to expressing the “is-a” relation. Such a restriction limits their applicability to more diverse real-world tasks where the parent-child may carry different relations. In this paper, we aim to construct a task-guided taxonomy from a domain-specific corpus, and allow users to input a “seed” taxonomy, serving as the task guidance. We propose an expansion-based taxonomy construction framework, namely HiExpan, which automatically generates key term list from the corpus and iteratively grows the seed taxonomy. Specifically, HiExpan views all children under each taxonomy node forming a coherent set and builds the taxonomy by recursively expanding all these sets. Furthermore, HiExpan incorporates a weakly-supervised relation extraction module to extract the initial children of a newly expanded node and adjusts the taxonomy tree by optimizing its global structure. Our experiments on three real datasets from different domains demonstrate the effectiveness of HiExpan for building task-guided taxonomies.more » « less
-
Traditional taxonomy provides a hierarchical organization of bacte- ria and archaea across taxonomic ranks from kingdom to subspecies. More recently, bacterial taxonomy has been more robustly quanti- fied using comparisons of sequenced genomes, as in the Genome Taxonomy Database (GTDB), resolving down to genera and species. Such taxonomies have proven useful in many contexts, yet lack the flexibility and resolution of a more fine-grained approach. We apply our Life Identification Number (LIN) approach as a com- mon, quantitative framework to tie existing (and future) bacterial taxonomies together, increase the resolution of genome-based dis- crimination of taxa, and extend taxonomic identification below the species level in a principled way. We utilize our existing concept of a LINgroup as an organizational concept for microorganisms that are closely related by overall genomic similarity, to help resolve some of the confusions and unforeseen negative effects of nomen- clature changes of microbes due to genome-based reclassification. Our results obtained from experimentation demonstrate the value of LINs and LINgroups in mapping between taxonomies, translat- ing between different nomenclatures, and integrating them into a single taxonomic framework.more » « less