Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.
more »
« less
ScrapeViz: Hierarchical Representations for Web Scraping Macros
Programming-by-demonstration (PBD) makes it possible to create web scraping macros without writing code. However, it can still be challenging for users to understand the exact scraping behavior that is inferred and to verify that the scraped data is correct, especially when scraping occurs across multiple pages. We present ScrapeViz, a new PBD tool for authoring and visualizing hierarchical web scraping macros. ScrapeViz’s key novelty is in providing a visual representation of web scraping macros-the sequences of pages visited, generalized scraping behavior across similar pages, and data provenance. We conducted a lab study with 12 participants comparing ScrapeViz to the existing web scraping tool Rousillon and saw that participants found ScrapeViz helpful for understanding high-level scraping behavior, tracing the source of scraped data, identifying anomalies, and validating macros while authoring.
more »
« less
- Award ID(s):
- 2007857
- PAR ID:
- 10570317
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-6613-6
- Page Range / eLocation ID:
- 300 to 305
- Format(s):
- Medium: X
- Location:
- Liverpool, United Kingdom
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
ABSTRACT Academic mobility has accelerated in part due to recent civil rights movements and higher levels of social mobility. This trend increases the threat of brain drain from Historically Black Colleges and Universities (HBCUs), which already face significant logistical challenges despite broad success in the advancement of Black professionals. We aim to examine this threat from a Science of Science perspective by collecting diachronic data for a large‐scale longitudinal analysis of HBCU faculty’s academic mobility. Our study uses Memento, manual collection, and web scraping to aggregate historical identifiers (URI‐Ms) of webpages from 35 HBCUs across multiple web archives. We are thus able to extend the use of “canonicalization” to associate past versions of webpages that resided at different URIs with their current URI allowing for a more accurate view of the pages over time. In this paper we define and execute a novel data collection method which is essential for our examination of HBCU human capital changes and supports a movement towards a more equitable academic workforce.more » « less
-
Internet companies routinely follow users around the web, building profiles for ad targeting based on inferred attributes. Prior work has shown that these practices, generally, are creepy—but what does that mean? To help answer this question, we substantially revised an open-source browser extension built to observe a user's browsing behavior and present them with a tracker's perspective of that behavior. Our updated extension models possible interest inferences far more accurately, integrates data scraped from the user's Google ad dashboard, and summarizes ads the user was shown. Most critically, it introduces ten novel visualizations that show implications of the collected data, both the mundane (e.g., total number of ads you've been served) and the provocative (e.g., your interest in reproductive health, a potentially sensitive topic). We use our extension as a design probe in a week-long field study with 200 participants. We find that users do perceive online tracking as creepy—but that the meaning of creepiness is far from universal. Participants felt differently about creepiness even when their data presented similar visualizations, and even when responding to the most potentially provocative visualizations—in no case did more than 66% of participants agree that any one visualization was creepy.more » « less
-
Tools that enable end-users to customize websites typically use a two-stage workflow: first, users extract data into a structured form; second, they use that extracted data to augment the original website in some way. This two-stage workflow poses a usability barrier because it requires users to make upfront decisions about what data to extract, rather than allowing them to incrementally extract data as they augment it. In this paper, we present a new, unified interaction model for web customization that encompasses both extraction and augmentation. The key idea is to provide users with a spreadsheet-like formula language that can be used for both data extraction and augmentation. We also provide a programming-by-demonstration (PBD) interface that allows users to create data extraction formulas by clicking on elements in the website. This model allows users to naturally and iteratively move between extraction and augmentation. To illustrate our unified interaction model, we have implemented a tool called Joker which is an extension of Wildcard, a prior web customization system. Through case studies, we show that Joker can be used to customize many real-world websites. We also present a formative user study with five participants, which showed that people with a wide range of technical backgrounds can use Joker to customize websites, and also revealed some interesting limitations of our approach.more » « less
-
null (Ed.)Websites are malleable: users can run code in the browser to customize them. However, this malleability is typically only accessible to programmers with knowledge of HTML and Javascript. Previously, we developed a tool called Wildcard which empowers end-users to customize websites through a spreadsheet-like table interface without doing traditional programming. However, there is a limit to end-user agency with Wildcard, because programmers need to first create site-specific adapters mapping website data to the table interface. This means that end-users can only customize a website if a programmer has written an adapter for it, and cannot extend or repair existing adapters. In this paper, we extend Wildcard with a new system for enduser web scraping for customization. It enables end-users to create, extend and repair adapters, by performing concrete demonstrations of how the website user interface maps to a data table. We describe three design principles that guided our system’s development and are applicable to other end-user web scraping and customization systems: (a) users should be able to scrape data and use it in a single, unified environment, (b) users should be able to extend and repair the programs that scrape data via demonstration and (c) users should receive live feedback during their demonstrations. We have successfully used our system to create, extend and repair adapters by demonstration on a variety of websites and we provide example usage scenarios that showcase each of our design principles. Our ultimate goal is to empower end-users to customize websites in the course of their daily use in an intuitive and flexible way, and thus making the web more malleable for all of its users.more » « less
An official website of the United States government

