Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets.
more »
« less
Data Hunches: Incorporating Personal Knowledge into Visualizations
The trouble with data is that it frequently provides only an imperfect representation of a phenomenon of interest. Experts who are familiar with their datasets will often make implicit, mental corrections when analyzing a dataset, or will be cautious not to be overly confident about their findings if caveats are present. However, personal knowledge about the caveats of a dataset is typically not incorporated in a structured way, which is problematic if others who lack that knowledge interpret the data. In this work, we define such analysts' knowledge about datasets as data hunches. We differentiate data hunches from uncertainty and discuss types of hunches. We then explore ways of recording data hunches, and, based on a prototypical design, develop recommendations for designing visualizations that support data hunches. We conclude by discussing various challenges associated with data hunches, including the potential for harm and challenges for trust and privacy. We envision that data hunches will empower analysts to externalize their knowledge, facilitate collaboration and communication, and support the ability to learn from others' data hunches.
more »
« less
- PAR ID:
- 10498725
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Transactions on Visualization and Computer Graphics
- Volume:
- 29
- Issue:
- 1
- ISSN:
- 1077-2626
- Page Range / eLocation ID:
- 504 to 514
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In the United States, state learning standards guide curriculum, assessment, teacher certification, and other key drivers of the student learning experience. Investigating standards allows us to answer a lot of big questions about the field of K-12 computer science (CS) education. Our team has created a dataset of state-level K-12 CS standards for all US states that currently have such standards (n = 42). This dataset was created by CS subject matter experts, who - for each of the approximately 10,000 state CS standards - manually tagged its assigned grade level/band, category/topic, and, if applicable, which CSTA standard it is identical or similar to. We also determined the standards' cognitive complexity using Bloom's Revised Taxonomy. Using the dataset, we were able to analyze each state's CS standards using a variety of metrics and approaches. To our knowledge, this is the first comprehensive, publicly available dataset of state CS standards that includes the factors mentioned previously. We believe that this dataset will be useful to other CS education researchers, including those who want to better understand the state and national landscape of K-12 CS education in the US, the characteristics of CS learning standards, the coverage of particular CS topics (e.g., cybersecurity, AI), and many other topics. In this lightning talk, we will introduce the dataset's features as well as some tools that we have developed (e.g., to determine a standard's Bloom's level) that may be useful to others who use the dataset.more » « less
-
If you want to estimate whether height is related to weight in humans, what would you do? You could measure the height and weight of a large number of people, and then run a statistical test. Such ‘independence tests’ can be thought of as a screening procedure: if the two properties (height and weight) are not related, then there is no point in proceeding with further analyses. In the last 100 years different independence tests have been developed. However, classical approaches often fail to accurately discern relationships in the large, complex datasets typical of modern biomedical research. For example, connectomics datasets include tens or hundreds of thousands of connections between neurons that collectively underlie how the brain performs certain tasks. Discovering and deciphering relationships from these data is currently the largest barrier to progress in these fields. Another drawback to currently used methods of independence testing is that they act as a ‘black box’, giving an answer without making it clear how it was calculated. This can make it difficult for researchers to reproduce their findings – a key part of confirming a scientific discovery. Vogelstein et al. therefore sought to develop a method of performing independence tests on large datasets that can easily be both applied and interpreted by practicing scientists. The method developed by Vogelstein et al., called Multiscale Graph Correlation (MGC, pronounced ‘magic’), combines recent developments in hypothesis testing, machine learning, and data science. The result is that MGC typically requires between one half to one third as big a sample size as previously proposed methods for analyzing large, complex datasets. Moreover, MGC also indicates the nature of the relationship between different properties; for example, whether it is a linear relationship or not. Testing MGC on real biological data, including a cancer dataset and a human brain imaging dataset, revealed that it is more effective at finding possible relationships than other commonly used independence methods. MGC was also the only method that explained how it found those relationships. MGC will enable relationships to be found in data across many fields of inquiry – and not only in biology. Scientists, policy analysts, data journalists, and corporate data scientists could all use MGC to learn about the relationships present in their data. To that extent, Vogelstein et al. have made the code open source in MATLAB, R, and Python.more » « less
-
Networks are a natural way of thinking about many datasets. The data on which a network is based, however, is rarely collected in a form that suits the analysis process, making it necessary to create and reshape networks. Data wrangling is widely acknowledged to be a critical part of the data analysis pipeline, yet interactive network wrangling has received little attention in the visualization research community. In this paper, we discuss a set of operations that are important for wrangling network datasets and introduce a visual data wrangling tool, Origraph, that enables analysts to apply these operations to their datasets. Key operations include creating a network from source data such as tables, reshaping a network by introducing new node or edge classes, filtering nodes or edges, and deriving new node or edge attributes. Our tool, Origraph, enables analysts to execute these operations with little to no programming, and to immediately visualize the results. Origraph provides views to investigate the network model, a sample of the network, and node and edge attributes. In addition, we introduce interfaces designed to aid analysts in specifying arguments for sensible network wrangling operations. We demonstrate the usefulness of Origraph in two Use Cases: first, we investigate gender bias in the film industry, and then the influence of money on the political support for the war in Yemen.more » « less
-
During the COVID-19 pandemic, most, if not all, animal rescues, sanctuaries, zoos, and aquariums experienced financial distress. This stress had an impact on the welfare of animals and their human caretakers, an issue important to ecological social work. We draw on a novel dataset (n = 2,060) to assess support for policies to extend emergency funding to animal support and conservation organizations in extreme events. We find that, on average women and nonbinary individuals, those with more education, people who have pets, people who are concerned about other humans (humanistic altruism), and those who have greater concern for animals report greater support.more » « less
An official website of the United States government

