Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high-level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 3.1k stars on Github.
more »
« less
Flexible rule-based decomposition and metadata independence in modin: a parallel dataframe system
Dataframes have become universally popular as a means to represent data in various stages of structure, and manipulate it using a rich set of operators---thereby becoming an essential tool in the data scientists' toolbox. However, dataframe systems, such as pandas, scale poorly---and are non-interactive on moderate to large datasets. We discuss our experiences developing Modin, our first cut at a parallel dataframe system, which already has users across several industries and over 1M downloads. Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that we formalize in this paper. We also introduce metadata independence to allow metadata---such as order and type---to be decoupled from the physical representation and maintained lazily. Using rule-based decomposition and metadata independence, along with careful engineering, Modin is able to support pandas operations across both rows and columns on very large dataframes---unlike Koalas and Dask DataFrames that either break down or are unable to support such operations, while also being much faster than pandas.
more »
« less
- Award ID(s):
- 1940757
- PAR ID:
- 10324483
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 15
- Issue:
- 3
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 739 to 751
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (superchargingthem!). To achieve this, we integrate ahigh-performance dataframesystem Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.more » « less
-
Analyzing the increasingly large volumes of data that are available today, possibly including the application of custom machine learning models, requires the utilization of distributed frameworks. This can result in serious productivity issues for “normal” data scientists. This paper introduces AFrame, a new scalable data analysis package powered by a Big Data management system that extends the data scientists' familiar DataFrame operations to efficiently operate on managed data at scale. AFrame is implemented as a layer on top of Apache AsterixDB, transparently scaling out the execution of DataFrame operations and machine learning model invocation through a parallel, shared-nothing big data management system. AFrame incrementally constructs SQL++ queries and leverages AsterixDB's semistructured data management facilities, user-defined function support, and live data ingestion support. In order to evaluate the proposed approach, this paper also introduces an extensible micro-benchmark for use in evaluating DataFrame performance in both single-node and distributed settings via a collection of representative analytic operations. This paper presents the architecture of AFrame, describes the underlying capabilities of AsterixDB that efficiently support modern data analytic operations, and utilizes the proposed benchmark to evaluate and compare the performance and support for largescale data analyses provided by alternative DataFrame libraries.more » « less
-
Computational notebooks, such as Jupyter, support rich data visualization. However, even when visualizations in notebooks are interactive, they are a dead end: Interactive data manipulations, such as selections, applying labels, filters, categorizations, or fixes to column or cell values, could be efficiently applied in interactive visual components, but interactive components typically cannot manipulate Python data structures. Furthermore, actions performed in interactive plots are lost as soon as the cell is re‐run, prohibiting reusability and reproducibility. To remedy this problem, we introduce Persist, a family of techniques to (a) capture interaction provenance, enabling the persistence of interactions, and (b) map interactions to data manipulations that can be applied to dataframes. We implement our approach as a JupyterLab extension that supports tracking interactions in Vega‐Altair plots and in a data table view. Persist can re‐execute interaction provenance when a notebook or a cell is re‐executed, enabling reproducibility and re‐use. We evaluate Persist in a user study targeting data manipulations with 11 participants skilled in Python and Pandas, comparing it to traditional code‐based approaches. Participants were consistently faster and were able to correctly complete more tasks with Persist.more » « less
-
Analyzing large-scale graphs poses challenges due to their increasing size and the demand for interactive and user-friendly analytics tools. These graphs arise from various domains, including cybersecurity, social sciences, health sciences, and network sciences, where networks can represent interactions between humans, neurons in the brain, or malicious flows in a network. Exploring these large graphs is crucial for revealing hidden structures and metrics that are not easily computable without parallel computing. Currently, Python users can leverage the open-source Arkouda framework to efficiently execute Pandas and NumPy-related tasks on thousands of cores. To address large-scale graph analysis, Arachne, an extension to Arkouda, enables easy transformation of Arkouda dataframes into graphs. This paper proposes and evaluates three distributable data structures for property graphs, implemented in Chapel, that are integrated into Arachne. Enriching Arachne with support for property graphs will empower data scientists to extend their analysis to new problem domains. Property graphs present additional complexities, requiring efficient storage for extra information on vertices and edges, such as labels, relationships, and properties.more » « less
An official website of the United States government

