he advancement of web programming techniques, such as Ajax and jQuery, and datastores, such as Apache Solr and Elasticsearch, have made it much easier to deploy small to medium scale web- based search engines. However, developing a sustainable search engine that supports scholarly big data services is still challenging often because of limited human resources and financial support. Such scenarios are typical in academic settings or small businesses. Here, we showcase how four key design decisions were made by trading-off competing factors such as performance, cost, and effi- ciency, when developing the Next Generation CiteSeerX (NGX), the successor of CiteSeerX, which was a pioneering digital library search engine that has been serving academic communities for more than two decades. This work extends our previous work in Wu et al. (2021) and discusses design considerations of infrastruc- ture, web applications, indexing, and document filtering. These design considerations can be generalized to other web-based search engines with a similar scale that are deployed in small business or academic settings with limited resources.
more »
« less
Multi-Perspective Learning to Rank to Support Children's Information Seeking in the Classroom
We introduce a re-ranking model that augments the functionality of standard search engines to aid classroom search activities for children (ages 6–11). This model extends the known listwise learning-to-rank framework by balancing risk and reward. Doing so enables the model to prioritize Web resources of high educational alignment, appropriateness, and adequate readability by analyzing the URLs, snippets, and page titles of Web resources retrieved by a mainstream search engine. Experimental results demonstrate the value of considering multiple perspectives inherent to the classroom when designing algorithms that can better support children's information discovery.
more »
« less
- Award ID(s):
- 1763649
- PAR ID:
- 10513305
- Publisher / Repository:
- 2023 IEEE International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT) (IEEE)
- Date Published:
- Page Range / eLocation ID:
- 311–317
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Understanding the past, present, and changing behavior of the climate requires close collaboration of a large number of researchers from many scientific domains. At present, the necessary interdisciplinary collaboration is greatly limited by the difficulties in discovering, sharing, and integrating climatic data due to the tremendously increasing data size. This paper discusses the methods and techniques for solving the inter-related problems encountered when transmitting, processing, and serving metadata for heterogeneous Earth System Observation and Modeling (ESOM) data. A cyberinfrastructure-based solution is proposed to enable effective cataloging and two-step search on big climatic datasets by leveraging state-of-the-art web service technologies and crawling the existing data centers. To validate its feasibility, the big dataset served by UCAR THREDDS Data Server (TDS), which provides Petabyte-level ESOM data and updates hundreds of terabytes of data every day, is used as the case study dataset. A complete workflow is designed to analyze the metadata structure in TDS and create an index for data parameters. A simplified registration model which defines constant information, delimits secondary information, and exploits spatial and temporal coherence in metadata is constructed. The model derives a sampling strategy for a high-performance concurrent web crawler bot which is used to mirror the essential metadata of the big data archive without overwhelming network and computing resources. The metadata model, crawler, and standard-compliant catalog service form an incremental search cyberinfrastructure, allowing scientists to search the big climatic datasets in near real-time. The proposed approach has been tested on UCAR TDS and the results prove that it achieves its design goal by at least boosting the crawling speed by 10 times and reducing the redundant metadata from 1.85 gigabytes to 2.2 megabytes, which is a significant breakthrough for making the current most non-searchable climate data servers searchable.more » « less
-
Hagen, Matthias and (Ed.)Readability is a core component of information retrieval (IR) tools as the complexity of a resource directly affects its relevance: a resource is only of use if the user can comprehend it. Even so, the link between readability and IR is often overlooked. As a step towards advancing knowledge on the influence of readability on IR, we focus on Web search for children. We explore how traditional formulas–which are simple, efficient, and portable–fare when applied to estimating the readability of Web resources for children written in English. We then present a formula well-suited for readability estimation of child-friendly Web resources. Lastly, we empirically show that readability can sway children’s information access. Outcomes from this work reveal that: (i) for Web resources targeting children, a simple formula suffices as long as it considers contemporary terminology and audience requirements, and (ii) instead of turning to Flesch-Kincaid–a popular formula–the use of the “right” formula can shape Web search tools to best serve children. The work we present herein builds on three pillars: Audience, Application, and Expertise. It serves as a blueprint to place readability estimation methods that best apply to and inform IR applications serving varied audiences.more » « less
-
Online classes are typically conducted by using video conferencing software such as Zoom, Microsoft Teams, and Google Meet. Research has identified drawbacks of online learning, such as “Zoom fatigue”, characterized by distractions and lack of engagement. This study presents the CUNY Affective and Responsive Virtual Environment (CARVE) Hub, a novel virtual reality hub that uses a facial emotion classification model to generate emojis for affective and informal responsive interaction in a 3D virtual classroom setting. A web-based machine learning model is employed for facial emotion classification, enabling students to communicate four basic emotions live through automated web camera capture in a virtual classroom without activating their cameras. The experiment is conducted in undergraduate classes on both Zoom and CARVE, and the results of a survey indicate that students have a positive perception of interactions in the proposed virtual classroom compared with Zoom. Correlations between automated emojis and interactions are also observed. This study discusses potential explanations for the improved interactions, including a decrease in pressure on students when they are not showing faces. In addition, video panels in traditional remote classrooms may be useful for communication but not for interaction. Students favor features in virtual reality, such as spatial audio and the ability to move around, with collaboration being identified as the most helpful feature.more » « less
-
Abstract The COVID-19 pandemic has stimulated important changes in online information access as digital engagement became necessary to meet the demand for health, economic, and educational resources. Our analysis of 55 billion everyday web search interactions during the pandemic across 25,150 US ZIP codes reveals that the extent to which different communities of internet users enlist digital resources varies based on socioeconomic and environmental factors. For example, we find that ZIP codes with lower income intensified their access to health information to a smaller extent than ZIP codes with higher income. We show that ZIP codes with higher proportions of Black or Hispanic residents intensified their access to unemployment resources to a greater extent, while revealing patterns of unemployment site visits unseen by the claims data. Such differences frame important questions on the relationship between differential information search behaviors and the downstream real-world implications on more and less advantaged populations.more » « less