skip to main content


Title: Lux: always-on visualization recommendations for exploratory dataframe workflows
Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high-level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 3.1k stars on Github.  more » « less
Award ID(s):
1940757
NSF-PAR ID:
10324482
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
15
Issue:
3
ISSN:
2150-8097
Page Range / eLocation ID:
727 to 738
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data visualization provides a powerful way for analysts to explore and make data-driven discoveries. However, current visual analytic tools provide only limited support for hypothesis-driven inquiry, as their built-in interactions and workflows are primarily intended for exploratory analysis. Visualization tools notably lack capabilities that would allow users to visually and incrementally test the fit of their conceptual models and provisional hypotheses against the data. This imbalance could bias users to overly rely on exploratory analysis as the principal mode of inquiry, which can be detrimental to discovery. In this paper, we introduce Visual (dis) Confirmation, a tool for conducting confirmatory, hypothesis-driven analyses with visualizations. Users interact by framing hypotheses and data expectations in natural language. The system then selects conceptually relevant data features and automatically generates visualizations to validate the underlying expectations. Distinctively, the resulting visualizations also highlight places where one's mental model disagrees with the data, so as to stimulate reflection. The proposed tool represents a new class of interactive data systems capable of supporting confirmatory visual analysis, and responding more intelligently by spotlighting gaps between one's knowledge and the data. We describe the algorithmic techniques behind this workflow. We also demonstrate the utility of the tool through a case study. 
    more » « less
  2. Dataframes have become universally popular as a means to represent data in various stages of structure, and manipulate it using a rich set of operators---thereby becoming an essential tool in the data scientists' toolbox. However, dataframe systems, such as pandas, scale poorly---and are non-interactive on moderate to large datasets. We discuss our experiences developing Modin, our first cut at a parallel dataframe system, which already has users across several industries and over 1M downloads. Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that we formalize in this paper. We also introduce metadata independence to allow metadata---such as order and type---to be decoupled from the physical representation and maintained lazily. Using rule-based decomposition and metadata independence, along with careful engineering, Modin is able to support pandas operations across both rows and columns on very large dataframes---unlike Koalas and Dask DataFrames that either break down or are unable to support such operations, while also being much faster than pandas. 
    more » « less
  3. Comparative visualizations and the comparison tasks they support constitute a crucial part of visual data analysis on complex data sets. Existing approaches are ad hoc and often require significant effort to produce comparative visualizations, which is impractical especially in cases where visualizations have to be amended in response to changes in the underlying data. We show that the combination of parameterized visualizations and variations yields an effective model for comparative visualizations. Our approach supports data exploration and automatic visualization updates when the underlying data changes. We provide a prototype implementation and demonstrate that our approach covers most of existing comparative visualizations. 
    more » « less
  4. Abstract Summary

    Gos is a declarative Python library designed to create interactive multiscale visualizations of genomics and epigenomics data. It provides a consistent and simple interface to the flexible Gosling visualization grammar. Gos hides technical complexities involved with configuring web-based genome browsers and integrates seamlessly within computational notebooks environments to enable new interactive analysis workflows.

    Availability and implementation

    Gos is released under the MIT License and available on the Python Package Index (PyPI). The source code is publicly available on GitHub (https://github.com/gosling-lang/gos), and documentation with examples can be found at https://gosling-lang.github.io/gos.

     
    more » « less
  5. We introduce Artifact-Based Rendering (ABR), a framework of tools, algorithms, and processes that makes it possible to produce real, data-driven 3D scientific visualizations with a visual language derived entirely from colors, lines, textures, and forms created using traditional physical media or found in nature. A theory and process for ABR is presented to address three current needs: (i) designing better visualizations by making it possible for non-programmers to rapidly design and critique many alternative data-to-visual mappings; (ii) expanding the visual vocabulary used in scientific visualizations to depict increasingly complex multivariate data; (iii) bringing a more engaging, natural, and human-relatable handcrafted aesthetic to data visualization. New tools and algorithms to support ABR include front-end applets for constructing artifact-based colormaps, optimizing 3D scanned meshes for use in data visualization, and synthesizing textures from artifacts. These are complemented by an interactive rendering engine with custom algorithms and interfaces that demonstrate multiple new visual styles for depicting point, line, surface, and volume data. A within-the-research-team design study provides early evidence of the shift in visualization design processes that ABR is believed to enable when compared to traditional scientific visualization systems. Qualitative user feedback on applications to climate science and brain imaging support the utility of ABR for scientific discovery and public communication. 
    more » « less