skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scholarly Very Large Data: Challenges for Digital Libraries
The volume of scholarly data has been growing exponentially over the last 50 years. The total size of the open access documents is estimated to be 35 million by 2022. The total amount of data to be handled, including crawled documents, production repository, metadata, extracted content, and their replications, can be as high as 350TB. Academic digital library search engines face significant challenges in maintaining sustainable services. We discuss these challenges and propose feasible solutions to key modules in the digital library architecture including the document storage, data extraction, database and index. We use CiteSeerX as a case study.  more » « less
Award ID(s):
1823288
PAR ID:
10173814
Author(s) / Creator(s):
Date Published:
Journal Name:
Large Scale Networking (LSN) Workshop on Huge Data: A Computing, Networking and Distributed Systems Perspective
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The volume of scholarly data has been growing exponentially over the last 50 years. The total size of the open access documents is estimated to be 35 million by 2022. The total amount of data to be handled, including crawled documents, production repository, metadata, extracted content, and their replications, can be as high as 350TB. Academic digital library search engines face significant challenges in maintaining sustainable services. We discuss these challenges and propose feasible solutions to key modules in the digital library architecture including the document storage, data extraction, database and index. We use CiteSeerX as a case study. 
    more » « less
  2. The volume of scholarly data has been growing exponentially over the last 50 years. The total size of the open access documents is estimated to be 35 million by 2022. The total amount of data to be handled, including crawled documents, production repository, metadata, extracted content, and their replications, can be as high as 350TB. Academic digital library search engines face signi cant challenges in maintaining sustainable services. We discuss these challenges and propose feasible solutions to key modules in the digital library architecture including the document storage, data extraction, database and index. We use CiteSeerX as a case study. 
    more » « less
  3. Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers who search for literature on specific subject areas. CiteSeerX is an example of such a digital library search engine that provides access to more than 10 million academic documents and has nearly one million users and three million hits per day. Artificial Intelligence (AI) technologies are used in many components of CiteSeerX including Web crawling, document ingestion, and metadata extraction. CiteSeerX also uses an unsupervised algorithm called noun phrase chunking (NP-Chunking) to extract keyphrases out of documents. However, often NP-Chunking extracts many unimportant noun phrases. In this paper, we investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX for extracting high quality keyphrases. To perform user evaluations on the keyphrases predicted by different models, we integrate a voting interface into CiteSeerX. We show the development and deployment of the keyphrase extraction models and the maintenance requirements. 
    more » « less
  4. Background There is increased interest in using artificial intelligence (AI) to provide participation-focused pediatric re/habilitation. Existing reviews on the use of AI in participation-focused pediatric re/habilitation focus on interventions and do not screen articles based on their definition of participation. AI-based assessments may help reduce provider burden and can support operationalization of the construct under investigation. To extend knowledge of the landscape on AI use in participation-focused pediatric re/habilitation, a scoping review on AI-based participation-focused assessments is needed. Objective To understand how the construct of participation is captured and operationalized in pediatric re/habilitation using AI. Methods We conducted a scoping review of literature published in Pubmed, PsycInfo, ERIC, CINAHL, IEEE Xplore, ACM Digital Library, ProQuest Dissertation and Theses, ACL Anthology, AAAI Digital Library, and Google Scholar. Documents were screened by 2–3 independent researchers following a systematic procedure and using the following inclusion criteria: (1) focuses on capturing participation using AI; (2) includes data on children and/or youth with a congenital or acquired disability; and (3) published in English. Data from included studies were extracted [e.g., demographics, type(s) of AI used], summarized, and sorted into categories of participation-related constructs. Results Twenty one out of 3,406 documents were included. Included assessment approaches mainly captured participation through annotated observations ( n = 20; 95%), were administered in person ( n = 17; 81%), and applied machine learning ( n = 20; 95%) and computer vision ( n = 13; 62%). None integrated the child or youth perspective and only one included the caregiver perspective. All assessment approaches captured behavioral involvement, and none captured emotional or cognitive involvement or attendance. Additionally, 24% ( n = 5) of the assessment approaches captured participation-related constructs like activity competencies and 57% ( n = 12) captured aspects not included in contemporary frameworks of participation. Conclusions Main gaps for future research include lack of: (1) research reporting on common demographic factors and including samples representing the population of children and youth with a congenital or acquired disability; (2) AI-based participation assessment approaches integrating the child or youth perspective; (3) remotely administered AI-based assessment approaches capturing both child or youth attendance and involvement; and (4) AI-based assessment approaches aligning with contemporary definitions of participation. 
    more » « less
  5. null (Ed.)
    Chemical Safety Data Sheets (SDS) are the primary method by which chemical manufacturers communicate the ingredients and hazards of their products to the public. These SDSs are used for a wide variety of purposes ranging from environmental calculations to occupational health assessments to emergency response measures. Although a few companies have provided direct digital data transfer platforms using xml or equivalent schemata, the vast majority of chemical ingredient and hazard communication to product users still occurs through the use of millions of PDF documents that are largely loaded through manual data entry into downstream user databases. This research focuses on the reverse engineering of SDS document types to adapt to various layouts and the harnessing of meta-algorithmic and neural network approaches to provide a means of moving industrial institutions towards a digital universal SDS processing methodology. The complexities of SDS documents including the lack of format standardization, text and image combinations, and multi-lingual translation needs, combined, limit the accuracy and precision of optical character recognition tools. The approach in this document is to translate entire SDSs from thousands of chemical vendors, each with distinct formatting, to machine-encoded text with a high degree of accuracy and precision. Then the system will "read" and assess these documents as a human would; that is, ensuring that the documents are compliant, determining whether chemical formulations have changed, ensuring reported values are within expected thresholds, and comparing them to similar products for more environmentally friendly alternatives. 
    more » « less