skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: SearchIE: A Retrieval Approach for Information Extraction
We address the problem of entity extraction with a very few examples and address it with an information retrieval approach. Existing extraction approaches consider millions of features extracted from a large number of training data cases. Generally, these data cases are generated by a distant supervision approach with entities in a knowledge base. After that a model is learned and entities are extracted. However, with extremely limited data a ranked list of relevant entities can be helpful to obtain user feedback to get more training data. As Information Retrieval (IR) is a natural choice for ranked list generation, we explore its effectiveness in such a limited data case. To this end, we propose SearchIE, a hybrid of IR and NLP approach that indexes documents represented using handcrafted NLP features. At query time SearchIE samples terms from a Logistic Regression model trained with extremely limited data. We show that SearchIE supersedes state-of-the-art NLP models to find civilians killed by US police officers with even a single civilian name as example.  more » « less
Award ID(s):
1617408
PAR ID:
10175987
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the International Conference on the Theory of Information Retrieval (ICTIR '19)
Page Range / eLocation ID:
249 to 252
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Semantic relationships, such as hyponym–hypernym, cause–effect, meronym–holonym etc., between a pair of entities in a sentence are usually reflected through syntactic patterns. Automatic extraction of such patterns benefits several downstream tasks, including, entity extraction, ontology building, and question answering. Unfortunately, automatic extraction of such patterns has not yet received much attention from NLP and information retrieval researchers. In this work, we propose an attention-based supervised deep learning model, ASPER, which extracts syntactic patterns between entities exhibiting a given semantic relation in the sentential context. We validate the performance of ASPER on three distinct semantic relations—hyponym–hypernym, cause–effect, and meronym–holonym on six datasets. Experimental results show that for all these semantic relations, ASPER can automatically identify a collection of syntactic patterns reflecting the existence of such a relation between a pair of entities in a sentence. In comparison to the existing methodologies of syntactic pattern extraction, ASPER’s performance is substantially superior. 
    more » « less
  2. There is an urgent need for ready access to published data for advances in materials design, and natural language processing (NLP) techniques offer a promising solution for extracting relevant information from scientific publications. In this paper, we present a domain-specific approach utilizing a Transformer-based model, T5, to automate the generation of sample lists in the field of polymer nanocomposites (PNCs). Leveraging large-scale corpora, we employ advanced NLP techniques including named entity recognition and relation extraction to accurately extract sample codes, compositions, group references, and properties from PNC papers. The T5 model demonstrates competitive performance in relation extraction using a TANL framework and an EM-style input sequence. Furthermore, we explore multi-task learning and joint-entity-relation extraction to enhance efficiency and address deployment concerns. Our proposed methodology, from corpora generation to model training, showcases the potential of structured knowledge extraction from publications in PNC research and beyond. 
    more » « less
  3. This work introduces TrialSieve, a novel framework for biomedical information extraction that enhances clinical meta-analysis and drug repurposing. By extending traditional PICO (Patient, Intervention, Comparison, Outcome) methodologies, TrialSieve incorporates hierarchical, treatment group-based graphs, enabling more comprehensive and quantitative comparisons of clinical outcomes. TrialSieve was used to annotate 1609 PubMed abstracts, 170,557 annotations, and 52,638 final spans, incorporating 20 unique annotation categories that capture a diverse range of biomedical entities relevant to systematic reviews and meta-analyses. The performance (accuracy, precision, recall, F1-score) of four natural-language processing (NLP) models (BioLinkBERT, BioBERT, KRISSBERT, PubMedBERT) and the large language model (LLM), GPT-4o, was evaluated using the human-annotated TrialSieve dataset. BioLinkBERT had the best accuracy (0.875) and recall (0.679) for biomedical entity labeling, whereas PubMedBERT had the best precision (0.614) and F1-score (0.639). Error analysis showed that NLP models trained on noisy, human-annotated data can match or, in most cases, surpass human performance. This finding highlights the feasibility of fully automating biomedical information extraction, even when relying on imperfectly annotated datasets. An annotator user study (n = 39) revealed significant (p < 0.05) gains in efficiency and human annotation accuracy with the unique TrialSieve tree-based annotation approach. In summary, TrialSieve provides a foundation to improve automated biomedical information extraction for frontend clinical research. 
    more » « less
  4. null (Ed.)
    Open-domain Keyphrase extraction (KPE) on the Web is a fundamental yet complex NLP task with a wide range of practical applications within the field of Information Retrieval. In contrast to other document types, web page designs are intended for easy navigation and information finding. Effective designs encode within the layout and formatting signals that point to where the important information can be found. In this work, we propose a modeling approach that leverages these multi-modal signals to aid in the KPE task. In particular, we leverage both lexical and visual features (e.g., size, font, position) at the micro-level to enable effective strategy induction, and metalevel features that describe pages at a macrolevel to aid in strategy selection. Our evaluation demonstrates that a combination of effective strategy induction and strategy selection within this approach for the KPE task outperforms state-of-the-art models. A qualitative post-hoc analysis illustrates how these features function within the model. 
    more » « less
  5. Abstract BackgroundNatural language processing (NLP) tasks in the health domain often deal with limited amount of labeled data due to high annotation costs and naturally rare observations. To compensate for the lack of training data, health NLP researchers often have to leverage knowledge and resources external to a task at hand. Recently, pretrained large-scale language models such as the Bidirectional Encoder Representations from Transformers (BERT) have been proven to be a powerful way of learning rich linguistic knowledge from massive unlabeled text and transferring that knowledge to downstream tasks. However, previous downstream tasks often used training data at such a large scale that is unlikely to obtain in the health domain. In this work, we aim to study whether BERT can still benefit downstream tasks when training data are relatively small in the context of health NLP. MethodWe conducted a learning curve analysis to study the behavior of BERT and baseline models as training data size increases. We observed the classification performance of these models on two disease diagnosis data sets, where some diseases are naturally rare and have very limited observations (fewer than 2 out of 10,000). The baselines included commonly used text classification models such as sparse and dense bag-of-words models, long short-term memory networks, and their variants that leveraged external knowledge. To obtain learning curves, we incremented the amount of training examples per disease from small to large, and measured the classification performance in macro-averaged$$F_{1}$$ F 1 score. ResultsOn the task of classifying all diseases, the learning curves of BERT were consistently above all baselines, significantly outperforming them across the spectrum of training data sizes. But under extreme situations where only one or two training documents per disease were available, BERT was outperformed by linear classifiers with carefully engineered bag-of-words features. ConclusionAs long as the amount of training documents is not extremely few, fine-tuning a pretrained BERT model is a highly effective approach to health NLP tasks like disease classification. However, in extreme cases where each class has only one or two training documents and no more will be available, simple linear models using bag-of-words features shall be considered. 
    more » « less