- Award ID(s):
- 1617408
- PAR ID:
- 10175987
- Date Published:
- Journal Name:
- Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR '19)
- Page Range / eLocation ID:
- 249 to 252
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Semantic relationships, such as hyponym–hypernym, cause–effect, meronym–holonym etc., between a pair of entities in a sentence are usually reflected through syntactic patterns. Automatic extraction of such patterns benefits several downstream tasks, including, entity extraction, ontology building, and question answering. Unfortunately, automatic extraction of such patterns has not yet received much attention from NLP and information retrieval researchers. In this work, we propose an attention-based supervised deep learning model, ASPER, which extracts syntactic patterns between entities exhibiting a given semantic relation in the sentential context. We validate the performance of ASPER on three distinct semantic relations—hyponym–hypernym, cause–effect, and meronym–holonym on six datasets. Experimental results show that for all these semantic relations, ASPER can automatically identify a collection of syntactic patterns reflecting the existence of such a relation between a pair of entities in a sentence. In comparison to the existing methodologies of syntactic pattern extraction, ASPER’s performance is substantially superior.more » « less
-
There is an urgent need for ready access to published data for advances in materials design, and natural language processing (NLP) techniques offer a promising solution for extracting relevant information from scientific publications. In this paper, we present a domain-specific approach utilizing a Transformer-based model, T5, to automate the generation of sample lists in the field of polymer nanocomposites (PNCs). Leveraging large-scale corpora, we employ advanced NLP techniques including named entity recognition and relation extraction to accurately extract sample codes, compositions, group references, and properties from PNC papers. The T5 model demonstrates competitive performance in relation extraction using a TANL framework and an EM-style input sequence. Furthermore, we explore multi-task learning and joint-entity-relation extraction to enhance efficiency and address deployment concerns. Our proposed methodology, from corpora generation to model training, showcases the potential of structured knowledge extraction from publications in PNC research and beyond.more » « less
-
null (Ed.)Open-domain Keyphrase extraction (KPE) on the Web is a fundamental yet complex NLP task with a wide range of practical applications within the field of Information Retrieval. In contrast to other document types, web page designs are intended for easy navigation and information finding. Effective designs encode within the layout and formatting signals that point to where the important information can be found. In this work, we propose a modeling approach that leverages these multi-modal signals to aid in the KPE task. In particular, we leverage both lexical and visual features (e.g., size, font, position) at the micro-level to enable effective strategy induction, and metalevel features that describe pages at a macrolevel to aid in strategy selection. Our evaluation demonstrates that a combination of effective strategy induction and strategy selection within this approach for the KPE task outperforms state-of-the-art models. A qualitative post-hoc analysis illustrates how these features function within the model.more » « less
-
Related work has demonstrated the helpfulness of utilizing information about entities in text retrieval; here we explore the converse: Utilizing information about text in entity retrieval. We model the relevance of Entity-Neighbor-Text (ENT) relations to derive a learning-to-rank-entities model. We focus on the task of retrieving (multiple) relevant entities in response to a topical information need such as "Zika fever". The ENT Rank model is designed to exploit semi-structured knowledge resources such as Wikipedia for entity retrieval. The ENT Rank model combines (1) established features of entity-relevance, with (2) information from neighboring entities (co-mentioned or mentioned-on-page) through (3) relevance scores of textual contexts through traditional retrieval models such as BM25 and RM3.more » « less
-
Abstract Background Natural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP.
Method We conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged
score.$$F_{1}$$ Results On the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features.
Conclusion As long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered.