Recent advances in deep learning have demonstrated the ability of learning-based methods to tackle very hard downstream tasks. Historically, this has been demonstrated in predictive tasks, while tasks more akin to the traditional KDD (Knowledge Discovery in Databases) pipeline have enjoyed proportionally fewer advances. Can learning-based approaches help with inherently hard problems within the KDD pipeline, such as how many patterns are in the data, what are different structures in the data, and how can we robustly extract those structures? In this vision paper, we argue for the need for synthetic data generators to empower cheaply-supervised learning-based solutions for knowledge discovery. We describe the general idea, early proof-of-concept results which speak to the viability of the paradigm, and we outline a number of exciting challenges that await, and a set of milestones for measuring success.
more »
« less
Statistically-sound Knowledge Discovery from Data
Knowledge Discovery from Data (KDD) has mostly focused on understanding the available data. Statistically-sound KDD shifts the goal to understanding the partially unknown, random Data Generating Process (DGP) process that generates the data. This shift is necessary to ensure that the results from data analysis constitute new knowledge about the DGP, as required by the practice of scientific research and by many industrial applications, to avoid costly false discoveries. In statistically-sound KDD, results obtained from the data are considered as hypotheses, and they must undergo statistical testing, before being deemed significant, i.e., informative about the DGP. The challenges include (1) how to subject the hypotheses to severe testing to make it hard for them to be deemed significant; (2) considering the simultaneous testing of multiple hypotheses as the default setting, not as an afterthought; (3) offering flexible statistical guarantees at different stages of the discovery process; and (4) achieving scalability along multiple axes, from the size of the data to the number and complexity of hypotheses to be tested. Success for Statistically-sound KDD as a field will be achieved with (1) the introduction of a rich collection of null models that are representative of the KDD tasks, and of the existing knowledge of the DGP by field experts; (2) the development of scalable algorithms for testing results for many KDD tasks on different data types; and (3) the availability of benchmark dataset generators that allow to thoroughly evaluate these algorithms.
more »
« less
- Award ID(s):
- 2006765
- PAR ID:
- 10464550
- Date Published:
- Journal Name:
- Proceedings of the 2023 SIAM International Conference on Data Mining (SDM)
- Page Range / eLocation ID:
- 949 - 952
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Summary Statistical hypotheses are translations of scientific hypotheses into statements about one or more distributions, often concerning their centre. Tests that assess statistical hypotheses of centre implicitly assume a specific centre, e.g., the mean or median. Yet, scientific hypotheses do not always specify a particular centre. This ambiguity leaves the possibility for a gap between scientific theory and statistical practice that can lead to rejection of a true null. In the face of replicability crises in many scientific disciplines, significant results of this kind are concerning. Rather than testing a single centre, this paper proposes testing a family of plausible centres, such as that induced by the Huber loss function. Each centre in the family generates a testing problem, and the resulting family of hypotheses constitutes a familial hypothesis. A Bayesian nonparametric procedure is devised to test familial hypotheses, enabled by a novel pathwise optimization routine to fit the Huber family. The favourable properties of the new test are demonstrated theoretically and experimentally. Two examples from psychology serve as real-world case studies.more » « less
-
We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data ana- lyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information- theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors). Keywords: distribution testing, identity testing, closeness testing, differential privacy, learning- augmented algorithmsmore » « less
-
Students often find it challenging to learn about complex and abstract biological processes. Using the engineering design process, which involves designing, building, and testing prototypes, can help students visualize the processes and anchor ideas from lab activities. We describe an engineering-design-integrated biology unit designed for high school students in which they learn about the properties of slime molds, the difference between eukaryotes and prokaryotes, and the iterative nature of the engineering design process. Using the engineering design process, students were successful in quarantining the slime mold from the non-inoculated oats. A t-test revealed statistically significant differences in students' understanding of slime mold characteristics, the difference between eukaryotes and prokaryotes, and the engineering design process before and after the unit. Overall, students demonstrated sound understanding of the biology core ideas and engineering design skills inherent in this unit.more » « less
-
Meila, Marina and (Ed.)InProceedings{pmlr-v139-si21a, title = {}, author = {}, booktitle = {}, pages = {9649--9659}, We have developed a statistical testing framework to detect if a given machine learning classifier fails to satisfy a wide range of group fairness notions. Our test is a flexible, interpretable, and statistically rigorous tool for auditing whether exhibited biases are intrinsic to the algorithm or simply due to the randomness in the data. The statistical challenges, which may arise from multiple impact criteria that define group fairness and which are discontinuous on model parameters, are conveniently tackled by projecting the empirical measure to the set of group-fair probability models using optimal transport. This statistic is efficiently computed using linear programming, and its asymptotic distribution is explicitly obtained. The proposed framework can also be used to test for composite fairness hypotheses and fairness with multiple sensitive attributes. The optimal transport testing formulation improves interpretability by characterizing the minimal covariate perturbations that eliminate the bias observed in the audit.more » « less
An official website of the United States government

