NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DMDD: A Large-Scale Dataset for Dataset Mentions Detection

https://doi.org/10.1162/tacl_a_00592

Pan, Huitong; Zhang, Qi; Dragut, Eduard; Caragea, Cornelia; Latecki, Longin Jan (September 2023, Transactions of the Association for Computational Linguistics)

Abstract The recognition of dataset names is a critical task for automatic information extraction in scientific literature, enabling researchers to understand and identify research opportunities. However, existing corpora for dataset mention detection are limited in size and naming diversity. In this paper, we introduce the Dataset Mentions Detection Dataset (DMDD), the largest publicly available corpus for this task. DMDD consists of the DMDD main corpus, comprising 31,219 scientific articles with over 449,000 dataset mentions weakly annotated in the format of in-text spans, and an evaluation set, which comprises 450 scientific articles manually annotated for evaluation purposes. We use DMDD to establish baseline performance for dataset mention detection and linking. By analyzing the performance of various models on DMDD, we are able to identify open problems in dataset mention detection. We invite the community to use our dataset as a challenge to develop novel dataset mention detection models.
more » « less
Full Text Available
Querying Climate Knowledge: Semantic Retrieval for Scientific Discovery

Adamu, Mustapha; Zhang, Qi; Pan, Huitong; Latecki, Longin; Dragut, Eduard (September 2025, the MANILA workshop series at SIGIR)

Free, publicly-accessible full text available September 9, 2026
Taxonomy-Driven Knowledge Graph Construction for Domain-Specific Scientific Applications

https://doi.org/10.18653/v1/2025.findings-acl.223

Pan, Huitong; Zhang, Qi; Adamu, Mustapha; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
DynClean: Training Dynamics-based Label Cleaning for Distantly-Supervised Named Entity Recognition

https://doi.org/10.18653/v1/2025.findings-naacl.137

Zhang, Qi; Pan, Huitong; Chen, Zhijia; Latecki, Longin Jan; Caragea, Cornelia; Dragut, Eduard (January 2025, Association for Computational Linguistics)

Full Text Available
ClimateIE: A Dataset for Climate Science Information Extraction

https://doi.org/10.18653/v1/2025.climatenlp-1.6

Pan, Huitong; Adamu, Mustapha; Zhang, Qi; Dragut, Eduard; Latecki, Longin Jan (January 2025, Association for Computational Linguistics)

Full Text Available
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

Zhang, Q; Chen, Z; Pan, H; Caragea, C; Latecki, Jan L; Dragut, E (November 2024, Empirical Methods in Natural Language Processing (EMNLP))

Full Text Available
FlowLearn: Evaluating Large Vision-Language Models on Flowchart Understanding

Pan, H; Zhang, Q; Caragea, C; Dragut, E; Latecki, Jan L (October 2024, European Conference on Artificial Intelligence (ECAI))

Full Text Available
SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions

Pan, H; Zhang, Q; Caragea, C; Dragut, E; Latecki, Jan L (May 2024, Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

Full Text Available
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents

https://doi.org/10.18653/v1/2024.emnlp-main.726

Zhang, Qi; Chen, Zhijia; Pan, Huitong; Caragea, Cornelia; Latecki, Longin Jan; Dragut, Eduard (January 2024, Association for Computational Linguistics)

Full Text Available

Search for: All records