Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

Wu, Jian; Hiltabrand, Ryan; Soós, Dominik; Giles, C. Lee

doi:10.1145/3558100.3563850

Citation Details

Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million schol- arly paper records. S2ORC contains a significant portion of automat- ically generated metadata. The metadata quality could impact down- stream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document confla- tion rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation. more »

Award ID(s):: 1823288

PAR ID:: 10473655

Author(s) / Creator(s):: Wu, Jian; Hiltabrand, Ryan; Soós, Dominik; Giles, C. Lee

Publisher / Repository:: ACM

Date Published:: 2022-09-20

Journal Name:: ACM Symposium on Document Engineering. (DocEng 2022)

ISBN:: 9781450395441

Page Range / eLocation ID:: 1 to 4

Format(s):: Medium: X

Location:: San Jose California

Sponsoring Org:: National Science Foundation

Conference Paper:
https://doi.org/10.1145/3558100.3563850

More Like this