skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Modeling Updates of Scholarly Webpages Using Archived Data
The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ( ) values. Our evaluation shows that   values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.  more » « less
Award ID(s):
1823288
PAR ID:
10271904
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2020 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
1868-1877
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While many online resources teach basic web development, few are designed to help novices learn the CSS concepts and design patterns experts use to implement complex visual fea- tures. Professional webpages embed these design patterns and could serve as rich learning materials, but their stylesheets are complex and difficult for novices to understand. This paper presents Ply, a CSS inspection tool that helps novices use their visual intuition to make sense of professional webpages. We introduce a new visual relevance testing technique to identify properties that have visual effects on the page, which Ply uses to hide visually irrelevant code and surface unintuitive relation- ships between properties. In user studies, Ply helped novice developers replicate complex web features 50% faster than those using Chrome Developer Tools, and allowed novices to recognize and explain unfamiliar concepts. These results show that visual inspection tools can support learning from complex professional webpages, even for novice developers. 
    more » « less
  2. By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa’s use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript’s execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive. 
    more » « less
  3. Data parallel frameworks become essential for training machine learning models. The classic Bulk Synchronous Parallel (BSP) model updates the model parameters through pre-defined synchronization barriers. However, when a worker computes significantly slower than other workers, waiting for the slow worker will lead to excessive waste of computing resources. In this paper, we propose a novel proactive data-parallel (PDP) framework. PDP enables the parameter server to initiate the update of the model parameter. That is, we can perform the update at any time without pre-defined update points. PDP not only initiates the update but also determines when to update. The global decision on the frequency of updates will accelerate the training. We further propose asynchronous PDP to reduce the idle time caused by synchronizing parameter updates. We theoretically prove the convergence property of asynchronous PDP. We implement a distributed PDP framework and evaluate PDP with several popular machine learning algorithms including Multilayer Perceptron, Convolutional Neural Network, K-means, and Gaussian Mixture Model. Our evaluation shows that PDP can achieve up to 20X speedup over the BSP model and scale to large clusters. 
    more » « less
  4. ABSTRACT Academic mobility has accelerated in part due to recent civil rights movements and higher levels of social mobility. This trend increases the threat of brain drain from Historically Black Colleges and Universities (HBCUs), which already face significant logistical challenges despite broad success in the advancement of Black professionals. We aim to examine this threat from a Science of Science perspective by collecting diachronic data for a large‐scale longitudinal analysis of HBCU faculty’s academic mobility. Our study uses Memento, manual collection, and web scraping to aggregate historical identifiers (URI‐Ms) of webpages from 35 HBCUs across multiple web archives. We are thus able to extend the use of “canonicalization” to associate past versions of webpages that resided at different URIs with their current URI allowing for a more accurate view of the pages over time. In this paper we define and execute a novel data collection method which is essential for our examination of HBCU human capital changes and supports a movement towards a more equitable academic workforce. 
    more » « less
  5. Navigating webpages with screen readers is a challenge even with recent improvements in screen reader technologies and the increased adoption of web standards for accessibility, namely ARIA. ARIA landmarks, an important aspect of ARIA, lets screen reader users access different sections of the webpage quickly, by enabling them to skip over blocks of irrelevant or redundant content. However, these landmarks are sporadically and inconsistently used by web developers, and in many cases, even absent in numerous web pages. Therefore,we propose SaIL, a scalable approach that automatically detects the important sections of a web page, and then injects ARIA landmarks into the corresponding HTML markup to facilitate quick access to these sections. The central concept underlying SaIL is visual saliency, which is determined using a state-of-the-art deep learning model that was trained on gaze-tracking data collected from sighted users in the context of web browsing. We present the findings of a pilot study that demonstrated the potential of SaIL in reducing both the time and effort spent in navigating webpages with screen readers. 
    more » « less