Learning from Irreproducibility: Introducing Data Leakage Case Studies for Machine Learning Education

Fund, Fraida; Saeed, Mohamed; Malik, Shaivi; Ishak, Kyrillos

Citation Details

This content will become publicly available on July 29, 2026

Learning from Irreproducibility: Introducing Data Leakage Case Studies for Machine Learning Education

Data leakage remains a pervasive issue in machine learning (ML), especially when applied to science, leading to overly optimistic performance estimates and irreproducible findings. Despite its prevalence, data leakage receives limited attention in ML education, in part due to the lack of accessible, hands-on teaching resources. To address this gap, we developed interactive learning modules in which students reproduce examples from academic publications that are affected by data leakage, then repeat the evaluation without the data leakage error to see how the finding is affected. These modules were deployed by the authors in two introductory machine learning courses, enabling students to explore common forms of leakage and their impact on model reliability. Following their engagement with these materials, student feedback highlighted increased awareness of subtle pitfalls that can compromise machine learning workflows. more »

Award ID(s):: 2226408

PAR ID:: 10636853

Author(s) / Creator(s):: Fund, Fraida; Saeed, Mohamed; Malik, Shaivi; Ishak, Kyrillos

Publisher / Repository:: ACM

Date Published:: 2025-07-29

Format(s):: Medium: X

Location:: Vancouver, BC, Canada

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 29, 2026
Conference Paper:
The DOI is not currently available.

More Like this