skip to main content


Title: Mitigating dataset harms requires stewardship: Lessons from 1000 papers
Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets—Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTMC— by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset.  more » « less
Award ID(s):
1763642
NSF-PAR ID:
10312099
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Machine learning datasets have elicited concerns about privacy, bias, and unethical applications, leading to the retraction of prominent datasets such as DukeMTMC, MS-Celeb-1M, and Tiny Images. In response, the machine learning community has called for higher ethical standards in dataset creation. To help inform these efforts, we studied three influential but ethically problematic face and person recognition datasets—Labeled Faces in the Wild (LFW), MS-Celeb-1M, and DukeMTMC— by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach to harm mitigation that considers the entire life cycle of a dataset. 
    more » « less
  2. Abstract Benchmark datasets and benchmark problems have been a key aspect for the success of modern machine learning applications in many scientific domains. Consequently, an active discussion about benchmarks for applications of machine learning has also started in the atmospheric sciences. Such benchmarks allow for the comparison of machine learning tools and approaches in a quantitative way and enable a separation of concerns for domain and machine learning scientists. However, a clear definition of benchmark datasets for weather and climate applications is missing with the result that many domain scientists are confused. In this paper, we equip the domain of atmospheric sciences with a recipe for how to build proper benchmark datasets, a (nonexclusive) list of domain-specific challenges for machine learning is presented, and it is elaborated where and what benchmark datasets will be needed to tackle these challenges. We hope that the creation of benchmark datasets will help the machine learning efforts in atmospheric sciences to be more coherent, and, at the same time, target the efforts of machine learning scientists and experts of high-performance computing to the most imminent challenges in atmospheric sciences. We focus on benchmarks for atmospheric sciences (weather, climate, and air-quality applications). However, many aspects of this paper will also hold for other aspects of the Earth system sciences or are at least transferable. Significance Statement Machine learning is the study of computer algorithms that learn automatically from data. Atmospheric sciences have started to explore sophisticated machine learning techniques and the community is making rapid progress on the uptake of new methods for a large number of application areas. This paper provides a clear definition of so-called benchmark datasets for weather and climate applications that help to share data and machine learning solutions between research groups to reduce time spent in data processing, to generate synergies between groups, and to make tool developments more targeted and comparable. Furthermore, a list of benchmark datasets that will be needed to tackle important challenges for the use of machine learning in atmospheric sciences is provided. 
    more » « less
  3. Co-creation in higher education is the process where students collaborate with instructors in designing the curriculum and associated educational material. This can take place in different scenarios, such as integrating co-creation into an ongoing course, modifying a previously taken course, or while creating a new course. In this Work-In-Progress, we investigate training and formative assessment models for preparing graduate students in engineering to participate as co-creators of educational material on an interdisciplinary topic. The topic of cyber-physical systems engineering and product lifecycle management with application to structural health monitoring is considered in this co-creation project. This entails not only topics from different disciplines of civil, computer, electrical and environmental engineering, business, and information sciences, but also humanistic issues of sustainability, environment, ethical and legal concerns in data-driven decision-making that support the control of cyber-physical systems. Aside from the objective of creating modules accessible to students with different levels of disciplinary knowledge, the goal of this research is to investigate if the co-creation process and the resulting modules also promote interest and engagement in interdisciplinary research. A literature survey of effective training approaches for co-creation and associated educational theories is summarized. For students, essential training components include providing (i) opportunities to align their interests, knowledge, skills, and values with the topic presented; (ii) experiential learning on the topic to help develop and enhance critical thinking and question posing skills, and (iii) safe spaces to reflect, voice their opinions, concerns, and suggestions. In this research we investigate the adaption of project-based learning (PjBL) strategies and practices to support (i) and (ii) and focus groups for participatory action research (PAR) as safe spaces for reflection, feedback, and action in item (iii). The co-creation process is assessed through qualitative analysis of data collected through the PjBL activities and PAR focus groups and other qualitative data (i.e., focus group transcripts, interview transcripts, project materials, fieldnotes, etc.). The eventual outcome of the co-creation process will be an on-line course module that is designed to be integrated in existing engineering graduate and undergraduate courses at four different institutions, which includes two state universities and two that are historically black colleges and universities. 
    more » « less
  4. Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of 99.7%. 
    more » « less
  5. This paper is an initial report of our fair AI design project by a small research team made up of anthropologists and computer scientists. Our collaborative project was developed in response to the recent debates on AI's ethical and social issues (Elish and boyd 2018). We share this understanding that "numbers don't speak for themselves," but data enters into research projects already "fully cooked" (D'Ignazio and Klein 2020). Therefore, we take an anthropological approach to observing, recording, understanding, and reflecting upon the process of machine learning algorithm design from the first steps of choosing and coding datasets for training and building algorithms. We tease apart the encoding of social-cultural paradigms in the generation and use of datasets in algorithm design and testing. By doing so, we rediscover the human in data to challenge the methodological and social assumptions in data use and then to adjust the model and parameters of our algorithms. This paper centers on tracing the social trajectory of the Correctional Offender Management Profiling for Alternative Sanctions, known as the COMPAS dataset. This dataset contains data of over 10,000 criminal defendants in Broward County in Florida, the U.S. Since its publication, it has become a benchmark dataset in the study of algorithmic fairness and was also used to design and train our algorithm for recidivism prediction. This paper presents our observation that data results from a complex set of social, political, and historical assumptions and circumstances and demonstrates how the social trajectory of data can be taken into the design of AI as automated systems become more intricate into our daily lives.” 
    more » « less