skip to main content


Title: On Landing and Internal Web Pages: The Strange Case of Jekyll and Hyde in Web Performance Measurement
There is a rich body of literature on measuring and optimizing nearly every aspect of the web, including characterizing the structure and content of web pages, devising new techniques to load pages quickly, and evaluating such techniques. Virtually all of this prior work used a single page, namely the landing page (i.e., root document, "/"), of each web site as the representative of all pages on that site. In this paper, we characterize the differences between landing and internal (i.e., non-root) pages of 1000 web sites to demonstrate that the structure and content of internal pages differ substantially from those of landing pages, as well as from one another. We review more than a hundred studies published at top-tier networking conferences between 2015 and 2019, and highlight how, in light of these differences, the insights and claims of nearly two-thirds of the relevant studies would need to be revised for them to apply to internal pages. Going forward, we urge the networking community to include internal pages for measuring and optimizing the web. This recommendation, however, poses a non-trivial challenge: How do we select a set of representative internal web pages from a web site? To address the challenge, we have developed Hispar, a "top list" of 100,000 pages updated weekly comprising both the landing pages and internal pages of around 2000 web sites. We make Hispar and the tools to recreate or customize it publicly available.  more » « less
Award ID(s):
1763742
NSF-PAR ID:
10318921
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IMC '20: Proceedings of the ACM Internet Measurement Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Web pages today commonly include large amounts of JavaScript code in order to offer users a dynamic experience. These scripts often make pages slow to load, partly due to a fundamental inefficiency in how browsers process JavaScript content: browsers make it easy for web developers to reason about page state by serially executing all scripts on any frame in a page, but as a result, fail to leverage the multiple CPU cores that are readily available even on low-end phones. In this paper, we show how to address this inefficiency without requiring pages to be rewritten or browsers to be modified. The key to our solution, Horcrux, is to account for the non-determinism intrinsic to web page loads and the constraints placed by the browser’s API for parallelism. Horcrux-compliant web servers perform offline analysis of all the JavaScript code on any frame they serve to conservatively identify, for every JavaScript function, the union of the page state that the function could access across all loads of that page. Horcrux’s JavaScript scheduler then uses this information to judiciously parallelize JavaScript execution on the client-side so that the end-state is identical to that of a serial execution, while minimizing coordination and offloading overheads. Across a wide range of pages, phones, and mobile networks covering web workloads in both developed and emerging regions, Horcrux reduces median browser computation delays by 31-44% and page load times by 18-37%. 
    more » « less
  2. Advertisements have become commonplace on modern websites. While ads are typically designed for visual consumption, it is unclear how they affect blind users who interact with the ads using a screen reader. Existing research studies on non-visual web interaction predominantly focus on general web browsing; the specific impact of extraneous ad content on blind users’ experience remains largely unexplored. To fill this gap, we conducted an interview study with 18 blind participants; we found that blind users are often deceived by ads that contextually blend in with the surrounding web page content. While ad blockers can address this problem via a blanket filtering operation, many websites are increasingly denying access if an ad blocker is active. Moreover, ad blockers often do not filter out internal ads injected by the websites themselves. Therefore, we devised an algorithm to automatically identify contextually deceptive ads on a web page. Specifically, we built a detection model that leverages a multi-modal combination of handcrafted and automatically extracted features to determine if a particular ad is contextually deceptive. Evaluations of the model on a representative test dataset and ‘in-the-wild’ random websites yielded F1 scores of 0.86 and 0.88, respectively. 
    more » « less
  3. By repeatedly crawling and saving web pages over time, web archives (such as the Internet Archive) enable users to visit historical versions of any page. In this paper, we point out that existing web archives are not well designed to cope with the widespread presence of JavaScript on the web. Some archives store petabytes of JavaScript code, and yet many pages render incorrectly when users load them. Other archives which store the end-state of page loads (e.g., screen captures) break post-load interactions implemented in JavaScript. To address these problems, we present Jawa, a new design for web archives which significantly reduces the storage necessary to save modern web pages while also improving the fidelity with which archived pages are served. Key to enabling Jawa’s use at scale are our observations on a) the forms of non-determinism which impair the execution of JavaScript on archived pages, and b) the ways in which JavaScript’s execution fundamentally differs between live web pages and their archived copies. On a corpus of 1 million archived pages, Jawa reduces overall storage needs by 41%, when compared to the techniques currently used by the Internet Archive. 
    more » « less
  4. Summary

    Companies often employ (i18n) frameworks to provide translated text and localized media content on their websites in order to effectively communicate with a global audience. However, the varying lengths of text from different languages can cause undesired distortions in the layout of a web page. Such distortions, called Internationalization Presentation Failures (IPFs), can negatively affect the aesthetics or usability of the website. Most of the existing automated techniques developed for assisting repair of IPFs either produce fixes that are likely to significantly reduce the legibility and attractiveness of the pages or are limited to only detecting IPFs, with the actual repair itself remaining a labour intensive manual task. To address this problem, we propose a search‐based technique for automatically repairing IPFs in web applications, while ensuring a legible and attractive page. The empirical evaluation of our approach reported that our approach was able to successfully resolve 94% of the detected IPFs for 46 real‐world web pages. In a user study, participants rated the visual quality of our fixes significantly higher than the unfixed versions and also considered the repairs generated by our approach to be notably more legible and visually appealing than the repairs generated by existing techniques.

     
    more » « less
  5. Web records are structured data on a Web page that embeds records retrieved from an underlying database according to some templates. Mining data records on the Web enables the integration of data from multiple Web sites for providing value-added services. Most existing works on Web record extraction make two key assumptions: (1) records are retrieved from databases with uniform schemas and (2) records are displayed in a linear structure on a Web page. These assumptions no longer hold on the modern Web. A Web page may present records of diverse entity types with different schemas and organize records hierarchically, in nested structures, to show richer relationships among records. In this paper, we revisit these assumptions and modify them to reflect Web pages on the modern Web. Based on the reformulated assumptions, we introduce the concept of invariant in Web data records and propose Miria (Mining record invariant), a bottom-up, recursive approach to construct the Web records from the invariants. The proposed approach is both effective and efficient, consistently outperforming the state-of-the-art Web record extraction methods on modern Web pages. 
    more » « less