skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content From URL-Logged Web-Browsing Digital Trace Data
Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.  more » « less
Award ID(s):
2120098
PAR ID:
10527610
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Social Science Computer Review
Date Published:
Journal Name:
Social Science Computer Review
ISSN:
0894-4393
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Programming-by-demonstration (PBD) makes it possible to create web scraping macros without writing code. However, it can still be challenging for users to understand the exact scraping behavior that is inferred and to verify that the scraped data is correct, especially when scraping occurs across multiple pages. We present ScrapeViz, a new PBD tool for authoring and visualizing hierarchical web scraping macros. ScrapeViz’s key novelty is in providing a visual representation of web scraping macros-the sequences of pages visited, generalized scraping behavior across similar pages, and data provenance. We conducted a lab study with 12 participants comparing ScrapeViz to the existing web scraping tool Rousillon and saw that participants found ScrapeViz helpful for understanding high-level scraping behavior, tracing the source of scraped data, identifying anomalies, and validating macros while authoring. 
    more » « less
  2. Legal jurisdictions around the world require organisations to post privacy policies on their websites. However, in spite of laws such as GDPR and CCPA reinforcing this requirement, organisations sometimes do not comply, and a variety of semi-compliant failure modes exist. To investigate the landscape of web privacy policies, we crawl the privacy policies from 7 million organisation websites with the goal of identifying when policies are unavailable. We conduct a large-scale investigation of the availability of privacy policies and identify potential reasons for unavailability such as dead links, documents with empty content, documents that consist solely of placeholder text, and documents unavailable in the specific languages offered by their respective websites. We estimate the frequencies of these failure modes and the overall unavailability of privacy policies on the web and find that privacy policies URLs are only available in 34% of websites. Further, 1.37% of these URLs are broken links and 1.23% of the valid links lead to pages without a policy. Further, to enable investigation of privacy policies at scale, we use the capture-recapture technique to estimate the total number of English language privacy policies on the web and the distribution of these documents across top level domains and sectors of commerce. We estimate the lower bound on the number of English language privacy policies to be around 3 million. Finally, we release the CoLIPPs Corpus containing around 600k policies and their metadata consisting of policy URL, length, readability, sector of commerce, and policy crawl date. 
    more » « less
  3. There is a rich body of literature on measuring and optimizing nearly every aspect of the web, including characterizing the structure and content of web pages, devising new techniques to load pages quickly, and evaluating such techniques. Virtually all of this prior work used a single page, namely the landing page (i.e., root document, "/"), of each web site as the representative of all pages on that site. In this paper, we characterize the differences between landing and internal (i.e., non-root) pages of 1000 web sites to demonstrate that the structure and content of internal pages differ substantially from those of landing pages, as well as from one another. We review more than a hundred studies published at top-tier networking conferences between 2015 and 2019, and highlight how, in light of these differences, the insights and claims of nearly two-thirds of the relevant studies would need to be revised for them to apply to internal pages. Going forward, we urge the networking community to include internal pages for measuring and optimizing the web. This recommendation, however, poses a non-trivial challenge: How do we select a set of representative internal web pages from a web site? To address the challenge, we have developed Hispar, a "top list" of 100,000 pages updated weekly comprising both the landing pages and internal pages of around 2000 web sites. We make Hispar and the tools to recreate or customize it publicly available. 
    more » « less
  4. Navigating unfamiliar websites is challenging for users with visual impairments. Although many websites offer visual cues to facilitate access to pages/features most websites are expected to have (e.g., log in at the top right), such visual shortcuts are not accessible to users with visual impairments. Moreover, although such pages serve the same functionality across websites (e.g., to log in, to sign up), the location, wording, and navigation path of links to these pages vary from one website to another. Such inconsistencies are challenging for users with visual impairments, especially for users of screen readers, who often need to linearly listen to content of pages to figure out how to access certain website features. To study how to improve access to main website features, we iteratively designed and tested a command-based approach for main features of websites via a browser extension powered by machine learning and human input. The browser extension gives users a way to access high-level website features (e.g., log in, find stores, contact) via keyboard commands. We tested the browser extension in a lab setting with 15 Internet users, including 9 users with visual impairments and 6 without. Our study showed that commands for main website features can greatly improve the experience of users with visual impairments. People without visual impairments also found command-based access helpful when visiting unfamiliar, cluttered, or infrequently visited websites, suggesting that this approach can support users with visual impairments while also benefiting other user groups (i.e., universal design). Our study reveals concerns about the handling of unsupported commands and the availability and trustworthiness of human input. We discuss how websites, browsers, and assistive technologies could incorporate a command-based paradigm to enhance web accessibility and provide more consistency on the web to benefit users with varied abilities when navigating unfamiliar or complex websites. 
    more » « less
  5. By mid-2023, the international GEOTRACES program had released three intermediate data products (IDP2014, IDP2017, and IDP2021), and in July 2023, an update of the latest product, IDP2021v2, was issued. All IDPs consist of two parts: (1) a compilation of digital data for large numbers of trace elements and isotopes (TEIs), and (2) the eGEOTRACES Electronic Atlas containing almost 1,500 pre-​created section plots and 269 animated three-dimensional scenes that can be browsed via an interactive web interface. GEOTRACES IDPs are used extensively and have proven to be rich resources for research, education, and outreach. Here, we demonstrate how these resources can be used efficiently and effectively via online services. Data browsing, analysis, and visualization occur in the user’s web browser, with the IDP data remaining on a dedicated server. Users simply visit specific resource URLs to access eGEOTRACES visuals and the GEOTRACES digital data directly. We first demonstrate how to navigate the eGEOTRACES Electronic Atlas to view TEI sections and three-dimensional animations. We then focus on two research use cases and provide detailed hands-on instructions for creating publication-ready figures related to the marine Zn cycle. 
    more » « less