Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines

Grafberger, Stefan; Stoyanovich, Julia; Schelter, Sebastian

Citation Details

Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policy makers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. In this paper we discuss such hard-to-identify data issues and describe mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. The key idea is to extract a directed acyclic graph representation of the data flow from ML preprocessing pipelines in Python, and to use this representation to automatically instrument the code with predefined inspections based on a lightweight annotation propagation approach. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines and does not require manual code instrumentation. We discuss the design and implementation of the mlinspect prototype, and give a complex end-to-end example that illustrates its functionality. more »

Award ID(s):: 1926250 1934464

PAR ID:: 10287319

Author(s) / Creator(s):: Grafberger, Stefan; Stoyanovich, Julia; Schelter, Sebastian

Date Published:: 2021-01-01

Journal Name:: Conference on Innovative Data Systems Research (CIDR)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this