skip to main content


Title: GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models
The field of relation extraction (RE) is experiencing a notable shift towards generative relation extraction (GRE), leveraging the capabilities of large language models (LLMs). However, we discovered that traditional relation extraction (RE) metrics like precision and recall fall short in evaluating GRE methods. This shortfall arises because these metrics rely on exact matching with human-annotated reference relations, while GRE methods often produce diverse and semantically accurate relations that differ from the references. To fill this gap, we introduce GENRES for a multidimensional assessment in terms of the topic similarity, uniqueness, granularity, factualness, and completeness of the GRE results. With GENRES, we empirically identified that (1) precision/recall fails to justify the performance of GRE methods; (2) human-annotated referential relations can be incomplete; (3) prompting LLMs with a fixed set of relations or entities can cause hallucinations. Next, we conducted a human evaluation of GRE methods that shows GENRES is consistent with human preferences for RE quality. Last, we made a comprehensive evaluation of fourteen leading LLMs using GENRES across document, bag, and sentence level RE datasets, respectively, to set the benchmark for future research in GRE.  more » « less
Award ID(s):
1956151
PAR ID:
10541808
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Duh, Kevin; G'omez-Adorno, Helena; Bethard, Steven
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Edition / Version:
1
Page Range / eLocation ID:
2820 to 2837
Subject(s) / Keyword(s):
RE Evaluation Generative Relation Extraction Large Language Models
Format(s):
Medium: X
Location:
Mexico City, Mexico
Sponsoring Org:
National Science Foundation
More Like this
  1. Proc. 2023 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (Ed.)
    Automated relation extraction without extensive human-annotated data is a crucial yet challenging task in text mining. Existing studies typically use lexical patterns to label a small set of high-precision relation triples and then employ distributional methods to enhance detection recall. This precision-first approach works well for common relation types but struggles with unconventional and infrequent ones. In this work, we propose a recall-first approach that first leverages high-recall patterns (e.g., a per:siblings relation normally requires both the head and tail entities in the person type) to provide initial candidate relation triples with weak labels and then clusters these candidate relation triples in a latent spherical space to extract high-quality weak supervisions. Specifically, we present a novel framework, RCLUS, where each relation triple is represented by its head/tail entity type and the shortest dependency path between the entity mentions. RCLUS first applies high-recall patterns to narrow down each relation type’s candidate space. Then, it embeds candidate relation triples in a latent space and conducts spherical clustering to further filter out noisy candidates and identify high-quality weakly-labeled triples. Finally, RCLUS leverages the above-obtained triples to prompt-tune a pre-trained language model and utilizes it for improved extraction coverage. We conduct extensive experiments on three public datasets and demonstrate that RCLUS outperforms the weakly-supervised baselines by a large margin and achieves generally better performance than fully-supervised methods in low-resource settings. 
    more » « less
  2. Abstract

    There has been a growing effort to replace manual extraction of data from research papers with automated data extraction based on natural language processing, language models, and recently, large language models (LLMs). Although these methods enable efficient extraction of data from large sets of research papers, they require a significant amount of up-front effort, expertise, and coding. In this work, we propose the method that can fully automate very accurate data extraction with minimal initial effort and background, using an advanced conversational LLM. consists of a set of engineered prompts applied to a conversational LLM that both identify sentences with data, extract that data, and assure the data’s correctness through a series of follow-up questions. These follow-up questions largely overcome known issues with LLMs providing factually inaccurate responses. can be applied with any conversational LLMs and yields very high quality data extraction. In tests on materials data, we find precision and recall both close to 90% from the best conversational LLMs, like GPT-4. We demonstrate that the exceptional performance is enabled by the information retention in a conversational model combined with purposeful redundancy and introducing uncertainty through follow-up prompts. These results suggest that approaches similar to , due to their simplicity, transferability, and accuracy are likely to become powerful tools for data extraction in the near future. Finally, databases for critical cooling rates of metallic glasses and yield strengths of high entropy alloys are developed using .

     
    more » « less
  3. Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT-4, in automating or assisting emotion annotation. We compare GPT-4 with supervised models and/or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate GPT-4's performance, and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies. 
    more » « less
  4. Accurate and comprehensive material databases extracted from research papers are crucial for ma- terials science and engineering, but their development requires significant human effort. With large language models (LLMs) transforming the way humans interact with text, LLMs provide an oppor- tunity to revolutionize data extraction. In this study, we demonstrate a simple and efficient method for extracting materials data from full-text research papers leveraging the capabilities of LLMs com- bined with human supervision. This approach is particularly suitable for mid-sized databases and requires minimal to no coding or prior knowledge about the extracted property. It offers high recall and nearly perfect precision in the resulting database. The method is easily adaptable to new and superior language models, ensuring continued utility. We show this by evaluating and comparing its performance on GPT-3 and GPT-3.5/4 (which underlie ChatGPT), as well as free alternatives such as BART and DeBERTaV3. We provide a detailed analysis of the method’s performance in extracting sentences containing bulk modulus data, achieving up to 90% precision at 96% recall, depending on the amount of human effort involved. We further demonstrate the method’s broader effectiveness by developing a database of critical cooling rates for metallic glasses over twice the size of previous human curated databases. 
    more » « less
  5. Abstract Objective

    SNOMED CT provides a standardized terminology for clinical concepts, allowing cohort queries over heterogeneous clinical data including Electronic Health Records (EHRs). While it is intuitive that missing and inaccurate subtype (or is-a) relations in SNOMED CT reduce the recall and precision of cohort queries, the extent of these impacts has not been formally assessed. This study fills this gap by developing quantitative metrics to measure these impacts and performing statistical analysis on their significance.

    Material and Methods

    We used the Optum de-identified COVID-19 Electronic Health Record dataset. We defined micro-averaged and macro-averaged recall and precision metrics to assess the impact of missing and inaccurate is-a relations on cohort queries. Both practical and simulated analyses were performed. Practical analyses involved 407 missing and 48 inaccurate is-a relations confirmed by domain experts, with statistical testing using Wilcoxon signed-rank tests. Simulated analyses used two random sets of 400 is-a relations to simulate missing and inaccurate is-a relations.

    Results

    Wilcoxon signed-rank tests from both practical and simulated analyses (P-values < .001) showed that missing is-a relations significantly reduced the micro- and macro-averaged recall, and inaccurate is-a relations significantly reduced the micro- and macro-averaged precision.

    Discussion

    The introduced impact metrics can assist SNOMED CT maintainers in prioritizing critical hierarchical defects for quality enhancement. These metrics are generally applicable for assessing the quality impact of a terminology’s subtype hierarchy on its cohort query applications.

    Conclusion

    Our results indicate a significant impact of missing and inaccurate is-a relations in SNOMED CT on the recall and precision of cohort queries. Our work highlights the importance of high-quality terminology hierarchy for cohort queries over EHR data and provides valuable insights for prioritizing quality improvements of SNOMED CT's hierarchy.

     
    more » « less