NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ACL-Fig: A Dataset for Scientific Figure Classification

Zeba Karishma, Shaurya Rohatgi (February 2023, AAAI Workshop on Scientific Document Understanding. (SDU 2023 @AAAI))
Design Considerations for a Sustainable Scholarly Big Data Service

https://doi.org/10.1145/3574318.3574340

Wu, Jian; Rohatgi, Shaurya; Angadi, Manoj K.; Puranik, Kavya S.; Giles, C. Lee (December 2022, Forum for Information Retrieval Evaluation. (FIRE 2022))

he advancement of web programming techniques, such as Ajax and jQuery, and datastores, such as Apache Solr and Elasticsearch, have made it much easier to deploy small to medium scale web- based search engines. However, developing a sustainable search engine that supports scholarly big data services is still challenging often because of limited human resources and financial support. Such scenarios are typical in academic settings or small businesses. Here, we showcase how four key design decisions were made by trading-off competing factors such as performance, cost, and effi- ciency, when developing the Next Generation CiteSeerX (NGX), the successor of CiteSeerX, which was a pioneering digital library search engine that has been serving academic communities for more than two decades. This work extends our previous work in Wu et al. (2021) and discusses design considerations of infrastruc- ture, web applications, indexing, and document filtering. These design considerations can be generalized to other web-based search engines with a similar scale that are deployed in small business or academic settings with limited resources.
more » « less
Full Text Available
Scholarly big data quality assessment: a case study of document linking and conflation with S2ORC

https://doi.org/10.1145/3558100.3563850

Wu, Jian; Hiltabrand, Ryan; Soós, Dominik; Giles, C. Lee (September 2022, ACM Symposium on Document Engineering. (DocEng 2022))

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million schol- arly paper records. S2ORC contains a significant portion of automat- ically generated metadata. The metadata quality could impact down- stream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document confla- tion rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation.
more » « less
Building an Accessible, Usable, Scalable, and Sustainable Service for Scholarly Big Data

Jian Wu, Shaurya Rohatgi (December 2021, Proceedings of the 2021 IEEE BigData Conference, (IEEE BigData 2021))

Full Text Available
Ranked List Fusion and Re-ranking with Pre-trained Transformers for ARQMath Lab

Shaurya Rohatgi, Jian Wu (September 2021, Proceedings of the 12th International Conference of the CLEF Association, (CLEF 2021))

Full Text Available
Extractive Research Slide Generation UsingWindowed Labeling Ranking

Sefid, A.; Wu, J.; Mitra, P.; Giles, C.L. (June 2021, Proceedings of the Second Workshop on ScProceedings of the Second Workshop on Scholarly Document Processing, 2021 Association for Computational Linguistics)
null (Ed.)
Presentation slides describing the content of scientific and technical papers are an efficient and effective way to present that work. However, manually generating presentation slides is labor intensive. We propose a method to automatically generate slides for scientific papers based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.
more » « less
Full Text Available
Large Scale Subject Category Classification of Scholarly Papers With Deep Attentive Neural Networks

https://doi.org/10.3389/frma.2020.600382

Kandimalla, Bharath; Rohatgi, Shaurya; Wu, Jian; Giles, C. Lee (February 2021, Frontiers in Research Metrics and Analytics)
null (Ed.)
Subject categories of scholarly papers generally refer to the knowledge domain(s) to which the papers belong, examples being computer science or physics. Subject category classification is a prerequisite for bibliometric studies, organizing scientific publications for domain knowledge extraction, and facilitating faceted searches for digital library search engines. Unfortunately, many academic papers do not have such information as part of their metadata. Most existing methods for solving this task focus on unsupervised learning that often relies on citation networks. However, a complete list of papers citing the current paper may not be readily available. In particular, new papers that have few or no citations cannot be classified using such methods. Here, we propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The network is trained using nine million abstracts from Web of Science (WoS). We also use the WoS schema that covers 104 subject categories. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. We compare our model against baselines by varying the architecture and text representation. Our best model achieves micro- F 1 measure of 0.76 with F 1 of individual subject categories ranging from 0.50 to 0.95. The results showed the importance of retraining word embedding models to maximize the vocabulary overlap and the effectiveness of the attention mechanism. The combination of word vectors with TFIDF outperforms character and sentence level embedding models. We discuss imbalanced samples and overlapping categories and suggest possible strategies for mitigation. We also determine the subject category distribution in CiteSeerX by classifying a random sample of one million academic papers.
more » « less
Full Text Available
Modeling Updates of Scholarly Webpages Using Archived Data

Jayawardana, Y.; Nwala, A.C.; Jayawardena, G.; Wu, J.; Jayarathna, S.; Nelson, M.L.; Giles, C.L. (December 2020, 2020 IEEE International Conference on Big Data (Big Data))
null (Ed.)
The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ( ) values. Our evaluation shows that values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
more » « less
Full Text Available
Acknowledgement Entity Recognition in CORD-19 Papers

Wu, J. (November 2020, Proceedings of the First Workshop on Scholarly Document Processing)
null (Ed.)
Acknowledgements are ubiquitous in scholarly papers. Existing acknowledgement entity recognition methods assume all named entities are acknowledged. Here, we examine the nuances between acknowledged and named entities by analyzing sentence structure. We develop an acknowledgement extraction system, ACKEXTRACT based on open-source text mining software and evaluate our method using manually labeled data. ACKEXTRACT uses the PDF of a scholarly paper as input and outputs acknowledgement entities. Results show an overall performance of F1 = 0:92. We built a supplementary database by linking CORD-19 papers with acknowledgement entities extracted by ACKEXTRACT including persons and organizations and find that only up to 50–60% of named entities are actually acknowledged. We further analyze chronological trends of acknowledgement entities in CORD-19 papers. All codes and labeled data are publicly available at https://github.com/ lamps-lab/ackextract.
more » « less
Full Text Available
COVIDSeer : Extending the CORD-19 Dataset

https://doi.org/10.1145/3395027.3419597

Rohatgi, S.; Karishma, Z.; Chhay, J.; Keesara, S.R.R.; Wu, J.; Caragea, C.; Giles, C.L. (September 2020, Proceedings of the ACM Symposium on Document Engineering 2020)

We develop an enhanced version of CORD-19 dataset released by the Allen Institute for AI. Tools in the SeerSuite project are used to exploit information in original articles not directly provided in the CORD-19 datasets. We add 728 new abstracts, 70,102 figures and 31,446 tables with captions that are not provided in the current data release. We also built a vertical search engine COVIDSeer based on the new dataset we created. COVIDSeer has a relatively simple architecture with features like keyword filtering, and similar paper recommendation. The goal was to provide a system and dataset that can help scientists better navigate through the literature concerning COVID-19. The enriched dataset can serve as a supplement to the existing dataset. The search engine, which offers keyphrase-enhanced search, will hopefully help biomedical and life science researchers, medical students, and the general public to more effectively explore coronavirus-related literature. The entire data set and the system will be made open source
more » « less
Full Text Available

« Prev Next »

Search for: All records