It is widely recognized that the ability to exploit Natural Language Processing (NLP) text mining strategies has the potential to increase productivity and innovation in the sciences by orders of magnitude, by enabling scientists to pull information from research articles in scientific disciplines such as genomics and biomedicine. These methods enable scientists to rapidly identify publications relevant to their own research as well as make scientific discoveries by scouring hundreds of research papers for associations and connections (such as between drugs and side effects, or genes and disease pathways) that humans reading each paper individually might not notice. The goal of our work is to enable rapid development of workflows for mining scientific publications and, crucially, means to adapt tools and workflows to data for specific disciplines (domain adaptation). Our work on this project is still ongoing; however, as a starting point we initially addressed the needs for mining biomedical publications, and a Galaxy instance of the LAPPS Grid tailored to mining biomedical publications is currently maintained on the JetStream cloud environment (https://jetstream.lappsgrid.org).
more »
« less
Mining Biomedical Publications With The LAPPS Grid
It is widely recognized that the ability to exploit Natural Language Processing (NLP) text mining strategies has the potential to increase productivity and innovation in the sciences by orders of magnitude, by enabling scientists to pull information from research articles in scientific disciplines such as genomics and biomedicine. The Language Applications (LAPPS) Grid is an infrastructure for rapid development of natural language processing applications (NLP) that provides an ideal platform to support mining scientific literature. Its Galaxy interface and the interoperability among tools together provide an intuitive and easy-to-use platform, and users can experiment with and exploit NLP tools and resources without the need to determine which are suited to a particular task, and without the need for significant computer expertise. The LAPPS Grid has collaborated with the developers of PubAnnotation to integrate the services and resources provided by each in order to greatly enhance the user’s ability to annotate scientific publications and share the results. This poster/demo shows how the LAPPS Grid can facilitate mining scientific publications, including identification and extraction of relevant entities, relations, and events; iterative manual correction and evaluation of automatically-produced annotations, and customization of supporting resources to accommodate specific domains.
more »
« less
- Award ID(s):
- 1811123
- PAR ID:
- 10096174
- Date Published:
- Journal Name:
- Proceedings of the Eleventh International Conference on Language Resources and Evaluation
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Karin Verspoor, Kevin Bretonnel (Ed.)In a recent project, the Language Applications Grid was augmented to support the mining of scientific publications. The results of that effort have now been repurposed to focus on Covid-19 literature, including modification of the LAPPS Grid “AskMe” query and retrieval engine. We describe the AskMe system and discuss its functionality as compared to other query engines available to search covid-related publications.more » « less
-
The LAPPS-CLARIN project is creating a “trust network” between the Language Applications (LAPPS) Grid and the WebLicht workflow engine hosted by the CLARIN-D Center in T¨ubingen. The project also includes integration of NLP services available from the LINDAT/CLARIN Center in Prague. The goal is to allow users on one side of the bridge to gain appropriately authenticated access to the other and enable seamless communication among tools and resources in both frameworks. The resulting “meta-framework” provides users across the globe with access to an unprecedented array of language processing facilities that cover multiple languages, tasks, and applications, all of which are fully interoperable.more » « less
-
Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.more » « less
-
There is an urgent need for ready access to published data for advances in materials design, and natural language processing (NLP) techniques offer a promising solution for extracting relevant information from scientific publications. In this paper, we present a domain-specific approach utilizing a Transformer-based model, T5, to automate the generation of sample lists in the field of polymer nanocomposites (PNCs). Leveraging large-scale corpora, we employ advanced NLP techniques including named entity recognition and relation extraction to accurately extract sample codes, compositions, group references, and properties from PNC papers. The T5 model demonstrates competitive performance in relation extraction using a TANL framework and an EM-style input sequence. Furthermore, we explore multi-task learning and joint-entity-relation extraction to enhance efficiency and address deployment concerns. Our proposed methodology, from corpora generation to model training, showcases the potential of structured knowledge extraction from publications in PNC research and beyond.more » « less
An official website of the United States government

