Flexible rule-based decomposition and metadata independence in modin: a parallel dataframe system
Dataframes have become universally popular as a means to represent data in various stages of structure, and manipulate it using a rich set of operators---thereby becoming an essential tool in the data scientists' toolbox. However, dataframe systems, such as pandas, scale poorly---and are non-interactive on moderate to large datasets. We discuss our experiences developing Modin, our first cut at a parallel dataframe system, which already has users across several industries and over 1M downloads. Modin translates pandas functions into a core set of operators that are individually parallelized via columnar, row-wise, or cell-wise decomposition rules that we formalize in this paper. We also introduce metadata independence to allow metadata---such as order and type---to be decoupled from the physical representation and maintained lazily. Using rule-based decomposition and metadata independence, along with careful engineering, Modin is able to support pandas operations across both rows and columns on very large dataframes---unlike Koalas and Dask DataFrames that either break down or are unable to support such operations, while also being much faster than pandas.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Proceedings of the VLDB Endowment
- Page Range or eLocation-ID:
- 739 to 751
- Sponsoring Org:
- National Science Foundation
More Like this
Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examplesmore »
Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) . Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner  and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. . We currently have a National Science Foundation (NSF) planning grant  to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive  and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software . We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per casemore »
In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision-making and applications. Scaling data analyses to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by providing the data scientists' familiar interface, Pandas Dataframe, and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a large volume of data, the initial version only generated SQL++ queries and only operated against AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems, making it more flexible for deployment against other data management systems with composable query languages.
Exploratory data science largely happens in computational notebooks with dataframe APIs, such as pandas, that support flexible means to transform, clean, and analyze data. Yet, visually exploring data in dataframes remains tedious, requiring substantial programming effort for visualization and mental effort to determine what analysis to perform next. We propose Lux, an always-on framework for accelerating visual insight discovery in dataframe workflows. When users print a dataframe in their notebooks, Lux recommends visualizations to provide a quick overview of the patterns and trends and suggests promising analysis directions. Lux features a high-level language for generating visualizations on demand to encourage rapid visual experimentation with data. We demonstrate that through the use of a careful design and three system optimizations, Lux adds no more than two seconds of overhead on top of pandas for over 98% of datasets in the UCI repository. We evaluate Lux in terms of usability via interviews with early adopters, finding that Lux helps fulfill the needs of data scientists for visualization support within their dataframe workflows. Lux has already been embraced by data science practitioners, with over 3.1k stars on Github.