Toward Better Efficiency vs. Fidelity Tradeoffs in Web Archives

Zhu, Jingyuan Zhu; Sun, Huanchen; Madhyastha, Harsha V

Citation Details

This content will become publicly available on October 28, 2026

Toward Better Efficiency vs. Fidelity Tradeoffs in Web Archives

Operators of web archives have two options for how to crawl pages from the web. Browser-based dynamic crawlers capture all of the resources on every page, but incur high compute overheads. Static browserless crawlers are more lightweight, but miss page resources which are fetched only when scripts are executed. In this paper, we make the case that a web archive does not have to make a binary choice between dynamic or static crawling. Instead, by using a browser for a carefully chosen small subset of crawls, an archive can significantly improve its ability to serve statically crawled pages with high fidelity. First, we show how to reuse crawled resources, both across pages and across multiple crawls of the same page over time. Second, by leveraging a dynamic crawl of a page, we show that subsequent static crawls of the page can be augmented to fetch resources without executing the scripts which request them. We estimate that, as long as 8.9% of page crawls use a browser, an archive can serve roughly 99% of the remaining statically crawled pages without any loss in fidelity, up from 55% without our techniques. more »

Award ID(s):: 2403432

PAR ID:: 10641471

Author(s) / Creator(s):: Zhu, Jingyuan Zhu; Sun, Huanchen; Madhyastha, Harsha V

Publisher / Repository:: ACM SIGCOMM Internet Measurement Conference

Date Published:: 2025-10-28

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on October 28, 2026
Conference Paper:
The DOI is not currently available.

More Like this