skip to main content


Title: Automating data science
Given the complexity of data science projects and related demand for human expertise, automation has the potential to transform the data science process.  more » « less
Award ID(s):
1900644
NSF-PAR ID:
10355513
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Communications of the ACM
Volume:
65
Issue:
3
ISSN:
0001-0782
Page Range / eLocation ID:
76 to 87
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Web-browsing histories, online newspapers, streaming music, and stock prices all show that we live in an age of data. Extracting meaning from data is necessary in many fields to comprehend the information flow. This need has fueled rapid growth in data science education aiming to serve the next generation of policy makers, data science researchers, and global citizens. Initially, teaching practices have been drawn from data science's parent disciplines (e.g., computer science and mathematics). This project addresses the early stages of developing a concept inventory of student difficulty within the newly emerging field of data science. In particular this project will address three primary research objectives: (1) identify student misconceptions in data science courses; (2) document students’ prior knowledge and identify courses that teach early data science concepts; and (3) confirm expert identification of data science concepts, and their importance for introductory-level data science curricula. During the first year of this grant, we have collected approximately 200 responses for a survey to confirm concepts from an existing body of knowledge presented by the Edison Project. Survey respondents are comprised of faculty and industry practitioners within data science and closely related fields. Preliminary analysis of these results will be presented with respect to our third research objective. In addition, we developed and launched a pilot assessment for identifying student difficulties within data science courses. The protocol includes regular responses to reflective questions by faculty, teaching assistants, and students from selected data science courses offered at the three participating institutions. Preliminary analyses will be presented along with implications for future data collection in year two of the project. In addition to the anticipated results, we expect that the data collection and analysis methodologies will be of interest to many scholars who have or will engage in discipline-based educational research. 
    more » « less
  2. Science and engineering applications are now generating data at an unprecedented rate. From large facilities such as the Large Hadron Collider to portable DNA sequencing devices, these instruments can produce hundreds of terabytes in short periods of time. Researchers and other professionals rely on networks to transfer data between sensing locations, instruments, data storage devices, and computing systems. While general-purpose networks, also referred to as enterprise networks, are capable of transporting basic data, such as e-mails and Web content, they face numerous challenges when transferring terabyte- and petabyte-scale data. At best, transfers of science data on these networks may last days or even weeks. In response to this challenge, the Science Demilitarized Zone (Science DMZ) has been proposed. The Science DMZ is a network or a portion of a network designed to facilitate the transfer of big science data. The main elements of the Science DMZ include: 1) specialized end devices, referred to as data transfer nodes (DTNs), built for sending/receiving data at a high speed over wide area networks; 2) high-throughput, friction-free paths connecting DTNs, instruments, storage devices, and computing systems; 3) performance measurement devices to monitor end-to-end paths over multiple domains; and 4) security policies and enforcement mechanisms tailored for high-performance environments. Despite the increasingly important role of Science DMZs, the literature is still missing a guideline to provide researchers and other professionals with the knowledge to broaden the understanding and development of Science DMZs. This paper addresses this gap by presenting a comprehensive tutorial on Science DMZs. The tutorial reviews fundamental network concepts that have a large impact on Science DMZs, such as router architecture, TCP attributes, and operational security. Then, the tutorial delves into protocols and devices at different layers, from the physical cyberinfrastructure to application-layer tools and security appliances, that must be carefully considered for the optimal operation of Science DMZs. This paper also contrasts Science DMZs with general-purpose networks, and presents empirical results and use cases applicable to current and future Science DMZs. 
    more » « less
  3. null (Ed.)
    Prompted by the skyrocketing demand for data scientists, progress made by the ACM Data Science Task Force on defining data science competencies, and inquiries about data science accreditation, ABET is in the process of developing accreditation criteria for undergraduate data science programs. The effort is led by members of a joint data science criteria subcommittee appointed by ABET’s Computing Accreditation Commission (CAC) and CSAB (the lead society for computing accreditation). Establishing data science accreditation criteria is a notable milestone in the maturing data science discipline, indicating the presence of an accepted body of knowledge, standards of practice, and ethical codes for practitioners. This position paper motivates the effort and discusses prior work towards defining data science education requirements. It describes the ongoing process for creating and obtaining approval of the accreditation criteria, and how feedback was and will be solicited from the computing and statistical communities. The current draft data science criteria, which was approved in July 2020 by the relevant ABET bodies for a year of public review and comment, is presented. These criteria emphasize the three pillars of data science: computing foundations, mathematical/statistical foundations, and experience in at least one data application domain. This report thus serves both to inform and to stimulate the academic discussion needed to finalize appropriate data science accreditation by ABET. 
    more » « less
  4. null (Ed.)
    As technology advances, data driven work is becoming increasingly important across all disciplines. Data science is an emerging field that encompasses a large array of topics including data collection, data preprocessing, data visualization, and data analysis using statistical and machine learning methods. As undergraduates enter the workforce in the future, they will need to “benefit from a fundamental awareness of and competence in data science”[9]. This project has formed a research practice partnership that brings together STEM+C instructors and researchers from three universities and an education research and consulting group. We aim to use high frequency monitoring data collected from real-world systems to develop and implement an interdisciplinary approach to enable undergraduate students to develop an understanding of data science concepts through individual STEM disciplines that include engineering, computer science, environmental science, and biology. In this paper, we perform an initial exploratory analysis on how data science topics are introduced into the different courses, with the ultimate goal of understanding how instructional modules and accompanying assessments can be developed for multidisciplinary use. We analyze information collected from instructor interviews and surveys, student surveys, and assessments from five undergraduate courses (243 students) at the three universities to understand aspects of data science curricula that are common across disciplines. Using a qualitative approach, we find commonalities in data science instruction and assessment components across the disciplines. This includes topical content, data sources, pedagogical approaches, and assessment design. Preliminary analyses of instructor interviews also suggest factors that affect the content taught and the assessment material across the five courses. These factors include class size, students’ year of study, students’ reasons for taking class, and students’ background expertise and knowledge. These findings indicate the challenges in developing data modules for multidisciplinary use. We hope that the analysis and reflections on our initial offerings has improved our understanding of these challenges, and how we may address them when designing future data science teaching modules. These are the first steps in a design-based approach to developing data science modules that may be offered across multiple courses. 
    more » « less
  5. null (Ed.)
    As technology advances, data driven work is becoming increasingly important across all disciplines. Data science is an emerging field that encompasses a large array of topics including data collection, data preprocessing, data visualization, and data analysis using statistical and machine learning methods. As undergraduates enter the workforce in the future, they will need to “benefit from a fundamental awareness of and competence in data science”[9]. This project has formed a research practice partnership that brings together STEM+C instructors and researchers from three universities and an education research and consulting group. We aim to use high frequency monitoring data collected from real-world systems to develop and implement an interdisciplinary approach to enable undergraduate students to develop an understanding of data science concepts through individual STEM disciplines that include engineering, computer science, environmental science, and biology. In this paper, we perform an initial exploratory analysis on how data science topics are introduced into the different courses, with the ultimate goal of understanding how instructional modules and accompanying assessments can be developed for multidisciplinary use. We analyze information collected from instructor interviews and surveys, student surveys, and assessments from five undergraduate courses (243 students) at the three universities to understand aspects of data science curricula that are common across disciplines. Using a qualitative approach, we find commonalities in data science instruction and assessment components across the disciplines. This includes topical content, data sources, pedagogical approaches, and assessment design. Preliminary analyses of instructor interviews also suggest factors that affect the content taught and the assessment material across the five courses. These factors include class size, students’ year of study, students’ reasons for taking class, and students’ background expertise and knowledge. These findings indicate the challenges in developing data modules for multidisciplinary use. We hope that the analysis and reflections on our initial offerings has improved our understanding of these challenges, and how we may address them when designing future data science teaching modules. These are the first steps in a design-based approach to developing data science modules that may be offered across multiple courses. 
    more » « less