- PAR ID:
- 10317779
- Date Published:
- Journal Name:
- The VLDB Journal
- ISSN:
- 1066-8888
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. In this paper we discuss such hard-to-identify data issues and describe mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. The key idea is to extract a directed acyclic graph representation of the data flow from ML preprocessing pipelines in Python, and to use this representation to automatically instrument the code with predefined inspections based on a lightweight annotation propagation approach. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect prototype, and give a complex end-to-end example that illustrates its functionality.more » « less
-
null (Ed.)Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While bias detection cannot be fully automated, computational tools can help pinpoint particular types of data issues. We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. In this demonstration, we show how mlinspect can be used to detect data distribution bugs in a representative pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. The library is publicly available at https://github.com/stefan-grafberger/mlinspect.more » « less
-
The performance of inference with machine learning (ML) models and its integration with analytical query processing have become critical bottlenecks for data analysis in many organizations. An ML inference pipeline typically consists of a preprocessing workflow followed by prediction with an ML model. Current approaches for in-database inference implement preprocessing operators and ML algorithms in the database either natively, by transpiling code to SQL, or by executing user-defined functions in guest languages such as Python. In this work, we present a radically different approach that approximates an end-to-end inference pipeline (preprocessing plus prediction) using a light-weight embedding that discretizes a carefully selected subset of the input features and an index that maps data points in the embedding space to aggregated predictions of an ML model. We replace a complex preprocessing workflow and model-based inference with a simple feature transformation and an index lookup. Our framework improves inference latency by several orders of magnitude while maintaining similar prediction accuracy compared to the pipeline it approximates.
-
Surfacing and mitigating bias in ML pipelines is a complex topic, with a dire need to provide system-level support to data scientists. Humans should be empowered to debug these pipelines, in order to control for bias and to improve data quality and representativeness. We propose fairDAGs, an open-source library that extracts directed acyclic graph (DAG) representations of the data flow in preprocessing pipelines for ML. The library subsequently instruments the pipelines with tracing and visualization code to capture changes in data distributions and identify distortions with respect to protected group membership as the data travels through the pipeline. We illustrate the utility of fairDAGs, with experiments on publicly available ML pipelines.more » « less
-
In addition to the standard observational assessment for autism spectrum disorder (ASD), recent advancements in neuroimaging and machine learning (ML) suggest a rapid and objective alternative using brain imaging. This work presents a pipelined framework, using functional magnetic resonance imaging (fMRI) that allows not only an accurate ASD diagnosis but also the identification of the brain regions contributing to the diagnosis decision. The proposed framework includes several processing stages: preprocessing, brain parcellation, feature representation, feature selection, and ML classification. For feature representation, the proposed framework uses both a conventional feature representation and a novel dynamic connectivity representation to assist in the accurate classification of an autistic individual. Based on a large publicly available dataset, this extensive research highlights different decisions along the proposed pipeline and their impact on diagnostic accuracy. A large publicly available dataset of 884 subjects from the Autism Brain Imaging Data Exchange I (ABIDE-I) initiative is used to validate our proposed framework, achieving a global balanced accuracy of 98.8% with five-fold cross-validation and proving the potential of the proposed feature representation. As a result of this comprehensive study, we achieve state-of-the-art accuracy, confirming the benefits of the proposed feature representation and feature engineering in extracting useful information as well as the potential benefits of utilizing ML and neuroimaging in the diagnosis and understanding of autism.more » « less