skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Design Considerations for a Sustainable Scholarly Big Data Service
he advancement of web programming techniques, such as Ajax and jQuery, and datastores, such as Apache Solr and Elasticsearch, have made it much easier to deploy small to medium scale web- based search engines. However, developing a sustainable search engine that supports scholarly big data services is still challenging often because of limited human resources and financial support. Such scenarios are typical in academic settings or small businesses. Here, we showcase how four key design decisions were made by trading-off competing factors such as performance, cost, and effi- ciency, when developing the Next Generation CiteSeerX (NGX), the successor of CiteSeerX, which was a pioneering digital library search engine that has been serving academic communities for more than two decades. This work extends our previous work in Wu et al. (2021) and discusses design considerations of infrastruc- ture, web applications, indexing, and document filtering. These design considerations can be generalized to other web-based search engines with a similar scale that are deployed in small business or academic settings with limited resources.  more » « less
Award ID(s):
1823288
PAR ID:
10473652
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Forum for Information Retrieval Evaluation. (FIRE 2022)
ISBN:
9798400700231
Page Range / eLocation ID:
83 to 87
Format(s):
Medium: X
Location:
Kolkata India
Sponsoring Org:
National Science Foundation
More Like this
  1. Scholarly digital libraries provide access to scientific publications and comprise useful resources for researchers who search for literature on specific subject areas. CiteSeerX is an example of such a digital library search engine that provides access to more than 10 million academic documents and has nearly one million users and three million hits per day. Artificial Intelligence (AI) technologies are used in many components of CiteSeerX including Web crawling, document ingestion, and metadata extraction. CiteSeerX also uses an unsupervised algorithm called noun phrase chunking (NP-Chunking) to extract keyphrases out of documents. However, often NP-Chunking extracts many unimportant noun phrases. In this paper, we investigate and contrast three supervised keyphrase extraction models to explore their deployment in CiteSeerX for extracting high quality keyphrases. To perform user evaluations on the keyphrases predicted by different models, we integrate a voting interface into CiteSeerX. We show the development and deployment of the keyphrase extraction models and the maintenance requirements. 
    more » « less
  2. We overview CiteSeerX, the pioneer digital library search engine, that has been serving academic communities for more than 20 years (first released in 1998), from three perspectives. The system perspective summarizes its architecture evolution in three phases over the past 20 years. The data perspective describes how CiteSeerX has created searchable scholarly big datasets and made them freely available for multiple purposes. In order to be scalable and effective, AI technologies are employed in all essential modules. To effectively train these models, a sufficient amount of data has been labeled, which can then be reused for training future models. Finally, we discuss the future of CiteSeerX. Our ongoing work is to make Cite- SeerX more sustainable. To this end, we are working to ingest all open access scholarly papers, estimated to be 30-40 million. Part of the plan is to discover dataset mentions and metadata in scholarly articles and make them more accessible via search interfaces. Users will have more opportunities to explore and trace datasets that can be reused and discover other datasets for new research projects. We summarize what was learned to make a similar system more sustainable and useful. 
    more » « less
  3. We introduce a re-ranking model that augments the functionality of standard search engines to aid classroom search activities for children (ages 6–11). This model extends the known listwise learning-to-rank framework by balancing risk and reward. Doing so enables the model to prioritize Web resources of high educational alignment, appropriateness, and adequate readability by analyzing the URLs, snippets, and page titles of Web resources retrieved by a mainstream search engine. Experimental results demonstrate the value of considering multiple perspectives inherent to the classroom when designing algorithms that can better support children's information discovery. 
    more » « less
  4. We report the preliminary work on cleansing and classifying a scholarly big dataset containing 10+ million academic documents released by CiteSeerX. We design novel approaches to match paper entities in CiteSeerX to reference datasets, including DBLP, Web of Science, and Medline, resulting in 4.2M unique matches, whose metadata can be cleansed. We also investigate traditional machine learning and neural network methods to classify abstracts into 6 subject categories. The classification results reveal that the current CiteSeerX dataset is highly multidisciplinary, containing papers well beyond computer and information sciences. 
    more » « less
  5. Agapito, G. (Ed.)
    The portable document format (PDF) is currently one of the most popular formats for offline sharing biomedical information. Recently, HTML-based formats for web-first biomedical information sharing have gained popularity. However, machine-interpretable information is required by literature search engines, such as Google Scholar, to index articles in a context-aware manner for accurate biomedical literature searches. The lack of technological infrastructure to add machine-interpretable metadata to expanding biomedical information, on the other hand, renders them unreachable to search engines. Therefore, we developed a portable technical infrastructure (goSemantically) and packaged it as a Google Docs add-ons. The “goSemantically” assists authors in adding machine-interpretable metadata at the terminology and document structural levels While authoring biomedical content. The “goSemantically” leverages the NCBO Bioportal resources and introduces a mechanism to annotate biomedical information with relevant machine-interpretable metadata (semantic vocabularies). The “goSemantically” also acquires schema.org meta tags designed for search engine optimization and tailored to accommodate biomedical information. Thus, individual authors can conveniently author and publish biomedical content in a truly decentralized fashion. Users can also export and host content with relevant machine-interpretable metadata (semantic vocabularies) in interoperable formats such as HTML and JSON-LD. To experience the described features, run this code with Google Doc 
    more » « less