Application of Large Language Models in Chemistry Reaction Data Extraction and Cleaning

Huang, Xiaobao; Surve, Mihir; Liu, Yuhan; Luo, Tengfei; Wiest, Olaf; Zhang, Xiangliang; Chawla, Nitesh V

doi:10.1145/3627673.3679874

Citation Details

Application of Large Language Models in Chemistry Reaction Data Extraction and Cleaning

Chemical reaction data has existed and still largely exists in unstructured forms. But curating such information into datasets suitable for tasks such as yield and reaction outcome prediction is impractical via manual curation and not possible to automate through programmatic means alone. Large language models (LLMs) have emerged as potent tools, showcasing remarkable capabilities in processing textual information and therefore could be extremely useful in automating this process. To address the challenge of unstructured data, we manually curated a dataset of structured chemical reaction data to fine-tune and evaluate LLMs. We propose a paradigm that leverages prompt-tuning, fine-tuning techniques, and a verifier to check the extracted information. We evaluate the capabilities of various LLMs, including LLAMA-2 and GPT models with different parameter counts, on the data extraction task. Our results show that prompt tuning of GPT-4 yields the best accuracy and evaluation results. Fine-tuning LLAMA-2 models with hundreds of samples does enable them and organize scientific material according to user-defined schemas better though. This workflow shows an adaptable approach for chemical reaction data extraction but also highlights the challenges associated with nuance in chemical information. We open-sourced our code at GitHub. more »

Award ID(s):: 2202693

PAR ID:: 10558020

Author(s) / Creator(s):: Huang, Xiaobao; Surve, Mihir; Liu, Yuhan; Luo, Tengfei; Wiest, Olaf; Zhang, Xiangliang; Chawla, Nitesh V

Publisher / Repository:: ACM

Date Published:: 2024-10-21

ISBN:: 9798400704369

Page Range / eLocation ID:: 3797 to 3801

Format(s):: Medium: X

Location:: Boise ID USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3627673.3679874

More Like this