NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications

https://doi.org/10.18653/v1/2025.findings-acl.223

Pan, Huitong; Zhang, Qi; Adamu, Mustapha; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
ClimateIE: A Dataset for Climate Science Information Extraction

https://doi.org/10.18653/v1/2025.climatenlp-1.6

Pan, Huitong; Adamu, Mustapha; Zhang, Qi; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

Pan, Huitong; Zhang, Qi; Caragea, Cornelia; Dragut, Eduard; Latecki, Longin J (August 2024, European Conference on Artificial Intelligence (ECAI))

Flowcharts are graphical tools for representing complex concepts in concise visual representations. This paper introduces the FlowLearn dataset, a resource tailored to enhance the understanding of flowcharts. FlowLearn contains complex scientific flowcharts and simulated flowcharts. The scientific subset contains 3,858 flowcharts sourced from scientific literature and the simulated subset contains 10,000 flowcharts created using a customizable script. The dataset is enriched with annotations for visual components, OCR, Mermaid code representation, and VQA question-answer pairs. Despite the proven capabilities of Large Vision-Language Models (LVLMs) in various visual understanding tasks, their effectiveness in decoding flowcharts—a crucial element of scientific communication—has yet to be thoroughly investigated. The FlowLearn test set is crafted to assess the performance of LVLMs in flowchart comprehension. Our study thoroughly evaluates state-of-the-art LVLMs, identifying existing limitations and establishing a foundation for future enhancements in this relatively underexplored domain. For instance, in tasks involving simulated flowcharts, GPT-4V achieved the highest accuracy (58\%) in counting the number of nodes, while Claude recorded the highest accuracy (83\%) in OCR tasks. Notably, no single model excels in all tasks within the FlowLearn framework, highlighting significant opportunities for further development.
more » « less
Full Text Available
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Pan, Huitong; Zhang, Qi; Caragea, Cornelia; Dragut, Eduard; Latecki, Longin (May 2024, COLING)

We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus’s scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus’s utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction
more » « less
Full Text Available
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

https://doi.org/10.18653/v1/2024.emnlp-main.726

Zhang, Qi; Chen, Zhijia; Pan, Huitong; Caragea, Cornelia; Latecki, Longin Jan; Dragut, Eduard (January 2024, Association for Computational Linguistics)

Full Text Available

Search for: All records