This content will become publicly available on June 5, 2026
Title: Benchmark data repositories for better benchmarking
In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for—and levies criticisms at—data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these benchmark data repositories and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning. more »« less
Koch, Bernard; Denton, Emily; Hanna, Alex; Foster, Jacob G.
(, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks)
Vanschoren, Joaquin; Yeung, Serena
(Ed.)
Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.
Dueben, Peter D.; Schultz, Martin G.; Chantry, Matthew; Gagne, David John; Hall, David Matthew; McGovern, Amy
(, Artificial Intelligence for the Earth Systems)
Abstract Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspects of this paper will also hold for other aspects of the Earth system sciences or are at least transferable. Significance Statement Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided.
Abdullah Algahtani, Hoda El-Sayed
(, International journal of intelligent systems and applications in engineering)
Abstract:The newer technologies such as data mining, machine learning, artificial intelligence and data analytics have revolutionized medical sector in terms of using the existing big data to predict the various patterns emerging from the datasets available inthe healthcare repositories. The predictions based on the existing datasets in the healthcare sector have rendered several benefits such as helping clinicians to make accurate and informed decisions while managing the patients’ health leading to better management of patients’ wellbeing and health-care coordination. The millions of people have been affected by the coronary artery disease (CAD). There are several machine learning including ensemble learning approach and deep neural networks-based algorithms have shown promising outcomes in improving prediction accuracy for early diagnosis of CAD. This paper analyses the deep neural network variant DRN, Rider Optimization Algorithm-Neural network (RideNN) and Deep Neural Network-Fuzzy Neural Network (DNFN) with application of ensemble learning method for improvement in the prediction accuracy of CAD. The experimental outcomes showed the proposed ensemble classifier achieved the highest accuracy compared to the other machine learning models. Keywords:Heart disease prediction, Deep Residual Network (DRN), Ensemble classifiers, coronary artery disease.
Abstract Many have argued that datasets resulting from scientific research should be part of the scholarly record as first class research products. Data sharing mandates from funding agencies and scientific journal publishers along with calls from the scientific community to better support transparency and reproducibility of scientific research have increased demand for tools and support for publishing datasets. Hydrology domain‐specific data publication services have been developed alongside more general purpose and even commercial data repositories. Prominent among these are the Hydrologic Information System (HIS) and HydroShare repositories developed by the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI). More broadly, however, multiple organizations have been involved in the practice of data publication in the hydrology domain, each having different roles that have shaped data publication and reuse. Bibliographic and archival approaches to data publication have been advanced, but both have limitations with respect to hydrologic data. Specific recommendations for improving data publication infrastructure, support, and practices to move beyond existing limitations and enable more effective data publication in support of scientific research in the hydrology domain include: improving support for journal article‐based data access and data citation, considering the workflow for data publication, enhancing support for reproducible science, encouraging publication of curated reference data collections, advancing interoperability standards for sharing data and metadata among repositories, developing partnerships with university libraries offering data services, and developing more specific data management plans. While presented in the context of CUAHSI's data repositories and experience, these recommendations are broadly applicable to other domains. This article is categorized under:Science of Water > Methods
Rinberg, Roy; Puigdemont, Pol; Pawelczyk, Martin; Cevher, Volkan
(, ICML 2025 Workshop on Machine Unlearning for Generative AI (https://openreview.net/group?id=ICML.cc/2025/Workshop/MUGen)
Evaluating machine unlearning methods remains technically challenging, with recent benchmarks requiring complex setups and significant engineering overhead. We introduce a unified and extensible benchmarking suite that simplifies the evaluation of unlearning algorithms using the KLoM (KL divergence of Margins) metric. Our framework provides precomputed model ensembles, oracle outputs, and streamlined infrastructure for running evaluations out of the box. By standardizing setup and metrics, it enables reproducible, scalable, and fair comparison across unlearning methods. We aim for this benchmark to serve as a practical foundation for accelerating research and promoting best practices in machine unlearning. Our code and data are publicly available.
Longjohn, Rachel, Kelly, Markelle, Singh, Sameer, and Smyth, Padhraic. Benchmark data repositories for better benchmarking. Retrieved from https://par.nsf.gov/biblio/10635518.
Longjohn, Rachel, Kelly, Markelle, Singh, Sameer, and Smyth, Padhraic.
"Benchmark data repositories for better benchmarking". Country unknown/Code not available: Neural Information Processing Systems (NeurIPS). https://par.nsf.gov/biblio/10635518.
@article{osti_10635518,
place = {Country unknown/Code not available},
title = {Benchmark data repositories for better benchmarking},
url = {https://par.nsf.gov/biblio/10635518},
abstractNote = {In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for—and levies criticisms at—data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these benchmark data repositories and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.},
journal = {},
publisher = {Neural Information Processing Systems (NeurIPS)},
author = {Longjohn, Rachel and Kelly, Markelle and Singh, Sameer and Smyth, Padhraic},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.