Toward Reliable Biodiversity Information Extraction From Large Language Models

Elliott, Michael J; Fortes, José_A B

doi:10.1109/e-Science62913.2024.10678666

Citation Details

Toward Reliable Biodiversity Information Extraction From Large Language Models

In this paper, we develop a method for extracting information from Large Language Models (LLMs) with associated confidence estimates. We propose that effective confidence models may be designed using a large number of uncertainty measures (i.e., variables that are only weakly predictive of - but positively correlated with - information correctness) as inputs. We trained a confidence model that uses 20 handcrafted uncertainty measures to predict GPT-4’s ability to reproduce species occurrence data from iDigBio and found that, if we only consider occurrence claims that are placed in the top 30% of confidence estimates, we can increase prediction accuracy from 57% to 88% for species absence predictions and from 77% to 86% for species presence predictions. Using the same confidence model, we used GPT- 4 to extract new data that extrapolates beyond the occurrence records in iDigBio and used the results to visualize geographic distributions for four individual species. More generally, this represents a novel use case for LLMs in generating credible pseudo data for applications in which high-quality curated data are unavailable or inaccessible. more »

Award ID(s):: 2027654

PAR ID:: 10549098

Author(s) / Creator(s):: Elliott, Michael J; Fortes, José_A B

Publisher / Repository:: IEEE

Date Published:: 2024-09-16

ISBN:: 979-8-3503-6561-0

Page Range / eLocation ID:: 1 to 10

Subject(s) / Keyword(s):: Uncertainty Accuracy Large language models Measurement uncertainty Predictive models Information retrieval Data models component formatting style styling

Format(s):: Medium: X

Location:: Osaka, Japan

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/e-Science62913.2024.10678666

More Like this