Web pages today commonly include large amounts of JavaScript code in order to offer users a dynamic experience. These scripts often make pages slow to load, partly due to a fundamental inefficiency in how browsers process JavaScript content: browsers make it easy for web developers to reason about page state by serially executing all scripts on any frame in a page, but as a result, fail to leverage the multiple CPU cores that are readily available even on low-end phones. In this paper, we show how to address this inefficiency without requiring pages to be rewritten or browsers to be modified. The key to our solution, Horcrux, is to account for the non-determinism intrinsic to web page loads and the constraints placed by the browser’s API for parallelism. Horcrux-compliant web servers perform offline analysis of all the JavaScript code on any frame they serve to conservatively identify, for every JavaScript function, the union of the page state that the function could access across all loads of that page. Horcrux’s JavaScript scheduler then uses this information to judiciously parallelize JavaScript execution on the client-side so that the end-state is identical to that of a serial execution, while minimizing coordination and offloading overheads. Across a wide range of pages, phones, and mobile networks covering web workloads in both developed and emerging regions, Horcrux reduces median browser computation delays by 31-44% and page load times by 18-37%. 
                        more » 
                        « less   
                    This content will become publicly available on October 28, 2026
                            
                            Toward Better Efficiency vs. Fidelity Tradeoffs in Web Archives
                        
                    
    
            Operators of web archives have two options for how to crawl pages from the web. Browser-based dynamic crawlers capture all of the resources on every page, but incur high compute overheads. Static browserless crawlers are more lightweight, but miss page resources which are fetched only when scripts are executed. In this paper, we make the case that a web archive does not have to make a binary choice between dynamic or static crawling. Instead, by using a browser for a carefully chosen small subset of crawls, an archive can significantly improve its ability to serve statically crawled pages with high fidelity. First, we show how to reuse crawled resources, both across pages and across multiple crawls of the same page over time. Second, by leveraging a dynamic crawl of a page, we show that subsequent static crawls of the page can be augmented to fetch resources without executing the scripts which request them. We estimate that, as long as 8.9% of page crawls use a browser, an archive can serve roughly 99% of the remaining statically crawled pages without any loss in fidelity, up from 55% without our techniques. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2403432
- PAR ID:
- 10641471
- Publisher / Repository:
- ACM SIGCOMM Internet Measurement Conference
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            null (Ed.)The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency ( ) values. Our evaluation shows that values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.more » « less
- 
            By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa’s use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript’s execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive.more » « less
- 
            Automated verification can ensure that a web page satisfies accessibility, usability, and design properties regardless of the end user's device, preferences, and assistive technologies. However, state-of-the-art verification tools for layout properties do not scale to large pages because they rely on whole-page analyses and must reason about the entire page using the complex semantics of the browser layout algorithm. This paper introduces and formalizes modular layout proofs. A modular layout proof splits a monolithic verification problem into smaller verification problems, one for each component of a web page. Each component specification can use rely/guarantee-style preconditions to make it verifiable independently of the rest of the page and enabling reuse across multiple pages. Modular layout proofs scale verification to pages an order of magnitude larger than those supported by previous approaches. We prototyped these techniques in a new proof assistant, Troika. In Troika, a proof author partitions a page into components and writes specifications for them. Troika then verifies the specifications, and uses those specifications to verify whole-page properties. Troika also enables the proof author to verify different component specifications with different verification tools, leveraging the strengths of each. In a case study, we use Troika to verify a large web page and demonstrate a speed-up of 13--1469x over existing tools, taking verification time from hours to seconds. We develop a systematic approach to writing Troika proofs and demonstrate it on 8 proofs of properties from prior work to show that modular layout proofs are short, easy to write, and provide benefits over existing tools.more » « less
- 
            Mobile web browsing remains slow despite many efforts to accelerate page loads. Like others, we find that client-side computation (in particular, JavaScript execution) is a key culprit. Prior solutions to mitigate computation overheads, however, suffer from security, privacy, and deployability issues, hindering their adoption. To sidestep these issues, we propose a browser-based solution in which every client reuses identical computations from its prior page loads. Our analysis across roughly 230 pages reveals that, even on a modern smartphone, such an approach could reduce client-side computation by a median of 49% on pages which are most in need of such optimizations.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
