skip to main content


This content will become publicly available on March 5, 2024

Title: Toward Scientific Evidence Standards in Empirical Computer Science
Many scientific fields of study use formally established evidence standards during the peer review and evaluation process, such as Consolidated Standards of Reporting Trials (CONSORT) in medical research, the What Works Clearinghouse (WWC) used in education in the United States, or the APA Journal Article Reporting Standards (JARS) in psychology. The basis for these standards is community agreement on what to report in empirical studies. Such standards achieve two key goals. First, they make it easier to compare studies, facilitating replications, through transparent reporting and sharing of data, which can provide confidence that multiple research teams can obtain the same results. Second, they establish community agreement on how to report on and evaluate studies using different methodologies. The discipline of computer science does not have formalized evidence standards, even for major conferences or journals. This Dagstuhl Seminar has three primary objectives: 1. To establish a process for creating or adopting an existing evidence standard for empirical research in computer science. 2. To build a community of scholars that can discuss what a general standard should include. 3. To kickstart the discussion with scholars from software engineering, human-computer interac- tion, and computer science education. In order to better discuss and understand the implications of such standards across several empirical subfields of computer science and to facilitate adoption, we brought together participants from a range of backgrounds; including academia and industry, software engineering, computer- human interaction and computer science education, as well as representatives from several prominent journals.  more » « less
Award ID(s):
2121993
NSF-PAR ID:
10446715
Author(s) / Creator(s):
Date Published:
Journal Name:
Dagstuhl reports
Volume:
12
Issue:
10
ISSN:
2192-5283
Page Range / eLocation ID:
225-240
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The report documents the program and outcomes of Dagstuhl Seminar 18061 "Evidence About Programmers for Programming Language Design". The seminar brought together a diverse group of researchers from the fields of computer science education, programming languages, software engineering, human-computer interaction, and data science. At the seminar, participants discussed methods for designing and evaluating programming languages that take the needs of programmers directly into account. The seminar included foundational talks to introduce the breadth of perspectives that were represented among the participants; then, groups formed to develop research agendas for several subtopics, including novice programmers, cognitive load, language features, and love of programming languages. The seminar concluded with a discussion of the current SIGPLAN artifact evaluation mechanism and the need for evidence standards in empirical studies of programming languages. 
    more » « less
  2. Who and by what means do we ensure that engineering education evolves to meet the ever changing needs of our society? This and other papers presented by our research team at this conference offer our initial set of findings from an NSF sponsored collaborative study on engineering education reform. Organized around the notion of higher education governance and the practice of educational reform, our open-ended study is based on conducting semi-structured interviews at over three dozen universities and engineering professional societies and organizations, along with a handful of scholars engaged in engineering education research. Organized as a multi-site, multi-scale study, our goal is to document differences in perspectives and interest the exist across organizational levels and institutions, and to describe the coordination that occurs (or fails to occur) in engineering education given the distributed structure of the engineering profession. This paper offers for all engineering educators and administrators a qualitative and retrospective analysis of ABET EC 2000 and its implementation. The paper opens with a historical background on the Engineers Council for Professional Development (ECPD) and engineering accreditation; the rise of quantitative standards during the 1950s as a result of the push to implement an engineering science curriculum appropriate to the Cold War era; EC 2000 and its call for greater emphasis on professional skill sets amidst concerns about US manufacturing productivity and national competitiveness; the development of outcomes assessment and its implementation; and the successive negotiations about assessment practice and the training of both of program evaluators and assessment coordinators for the degree programs undergoing evaluation. It was these negotiations and the evolving practice of assessment that resulted in the latest set of changes in ABET engineering accreditation criteria (“1-7” versus “a-k”). To provide an insight into the origins of EC 2000, the “Gang of Six,” consisting of a group of individuals loyal to ABET who used the pressure exerted by external organizations, along with a shared rhetoric of national competitiveness to forge a common vision organized around the expanded emphasis on professional skill sets. It was also significant that the Gang of Six was aware of the fact that the regional accreditation agencies were already contemplating a shift towards outcomes assessment; several also had a background in industrial engineering. However, this resulted in an assessment protocol for EC 2000 that remained ambiguous about whether the stated learning outcomes (Criterion 3) was something faculty had to demonstrate for all of their students, or whether EC 2000’s main emphasis was continuous improvement. When it proved difficult to demonstrate learning outcomes on the part of all students, ABET itself began to place greater emphasis on total quality management and continuous process improvement (TQM/CPI). This gave institutions an opening to begin using increasingly limited and proximate measures for the “a-k” student outcomes as evidence of effort and improvement. In what social scientific terms would be described as “tactical” resistance to perceived oppressive structures, this enabled ABET coordinators and the faculty in charge of degree programs, many of whom had their own internal improvement processes, to begin referring to the a-k criteria as “difficult to achieve” and “ambiguous,” which they sometimes were. Inconsistencies in evaluation outcomes enabled those most discontented with the a-k student outcomes to use ABET’s own organizational processes to drive the latest revisions to EAC accreditation criteria, although the organization’s own process for member and stakeholder input ultimately restored much of the professional skill sets found in the original EC 2000 criteria. Other refinements were also made to the standard, including a new emphasis on diversity. This said, many within our interview population believe that EC 2000 had already achieved much of the changes it set out to achieve, especially with regards to broader professional skills such as communication, teamwork, and design. Regular faculty review of curricula is now also a more routine part of the engineering education landscape. While programs vary in their engagement with ABET, there are many who are skeptical about whether the new criteria will produce further improvements to their programs, with many arguing that their own internal processes are now the primary drivers for change. 
    more » « less
  3. The Standards for educational and psychological assessment were developed by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (AERA et al., 2014). The Standards specify assessment developers establish five types of validity evidence: test content, response processes, internal structure, relationship to other variables, and consequential/bias. Relevant to this proposal is consequential validity evidence that identifies the potential negative impact of testing or bias. Standard 3.1 of The Standards (2014) on fairness in testing states that “those responsible for test development, revision, and administration should design all steps of the testing process to promote valid score interpretations for intended score uses for the widest possible range of individuals and relevant sub-groups in the intended populations” (p. 63). Three types of bias include construct, method, and item bias (Boer et al., 2018). Testing for differential item functioning (DIF) is a standard analysis adopted to detect item bias against a subgroup (Boer et al., 2018). Example subgroups include gender, race/ethnic group, socioeconomic status, native language, or disability. DIF is when “equally able test takers differ in their probabilities answering a test item correctly as a function of group membership” (AERA et al., 2005, p. 51). DIF indicates systematic error as compared to real mean group differences (Camilli & Shepard, 1994). Items exhibiting significant DIF are removed or reviewed for sources leading to bias to determine modifications to retain and further test an item. The Delphi technique is an emergent systematic research method whereby expert panel members review item content through an iterative process (Yildirim & Büyüköztürk, 2018). Experts independently evaluate each item for potential sources leading to DIF, researchers group their responses, and experts then independently complete a survey to rate their level of agreement with the anonymously grouped responses. This process continues until saturation and consensus are reached among experts as established through some criterion (e.g., median agreement rating, item quartile range, and percent agreement). The technique allows researchers to “identify, learn, and share the ideas of experts by searching for agreement among experts” (Yildirim & Büyüköztürk, 2018, p. 451). Research has illustrated this technique applied after DIF is detected, but not before administering items in the field. The current research is a methodological illustration of the Delphi technique applied in the item construction phase of assessment development as part of a five-year study to develop and test new problem-solving measures (PSM; Bostic et al., 2015, 2017) for U.S.A. grades 6-8 in a computer adaptive testing environment. As part of an iterative design-science-based methodology (Middleton et al., 2008), we illustrate the integration of the Delphi technique into the item writing process. Results from two three-person panels each reviewing a set of 45 PSM items are utilized to illustrate the technique. Advantages and limitations identified through a survey by participating experts and researchers are outlined to advance the method. 
    more » « less
  4. null (Ed.)
    For several years, the software engineering research community used eye trackers to study program comprehension, bug localization, pair programming, and other software engineering tasks. Eye trackers provide researchers with insights on software engineers’ cognitive processes, data that can augment those acquired through other means, such as on-line surveys and questionnaires. While there are many ways to take advantage of eye trackers, advancing their use requires defining standards for experimental design, execution, and reporting. We begin by presenting the foundations of eye tracking to provide context and perspective. Based on previous surveys of eye tracking for programming and software engineering tasks and our collective, extensive experience with eye trackers, we discuss when and why researchers should use eye trackers as well as how they should use them. We compile a list of typical use cases—real and anticipated—of eye trackers, as well as metrics, visualizations, and statistical analyses to analyze and report eye-tracking data. We also discuss the pragmatics of eye tracking studies. Finally, we offer lessons learned about using eye trackers to study software engineering tasks. This paper is intended to be a one-stop resource for researchers interested in designing, executing, and reporting eye tracking studies of software engineering tasks. 
    more » « less
  5. With support from NSF Scholarships in Science, Technology, Engineering, and Mathematics (S-STEM), the Culturally Adaptive Pathway to Success (CAPS) program aims to build an inclusive pathway to accelerate the graduation for academically talented, low-income students in Engineering and Computer Science majors at [University Name], which traditionally serves the underrepresented and educationally disadvantaged minority students in the [City Name area]. CAPS focuses on progressively developing social and career competence in our students via three integrated interventions: (1) Mentor+, a relationally informed advising strategy that encourages students to see their academic work in relation to their families and communities; (2) peer cohorts, providing social support structure for students and enhancing their sense of belonging in engineering and computer science classrooms and beyond; and (3) professional development from faculty who have been trained in difference-education theory, so that they can support students with varying levels of understanding of the antecedents of college success. To ensure success of these interventions, the CAPS program places great emphasis on developing culturally responsive advisement methods and training faculty mentors to facilitate creating a culture of culturally adaptive advising. This paper presents the CAPS progress in the past two project years. In particular, we will share several changes that we have made after the first project year to improve several key components of the program - recruitment, cohort building, and mentor training. The program strengthened the recruitment by actively involving scholars and faculties in reaching out to students and successfully recruited more scholars for the second cohort (16 scholars) than the first cohort (12 scholars). Also, the program has initiated new activities for peer-mentoring and cohort gathering within each major. As continuous development of the mentor training, the program has added a training session focusing on various aspects of intersectionality as it relates to individual’s social identities, and how mentors can use these knowledge to better interact with mentees. In addition to these changes, we will also report findings on how the program impacted on scholars’ academic growth and mentors’ understanding about the culturally adaptive advisement to answer the CAPS research questions (a) how these interventions affect the development of social belonging and engineering identity of CAPS scholars, and (b) the impact of Mentor+ on academic resilience and progress to degree. The program conducted qualitative data collection and analysis via focus group meetings and interviews as well as quantitative data collection and analysis using academic records and surveys. Our findings will help enhance the CAPS program and establish a sustainable Scholars Support Program at the university, which can be implemented with scholarships funded by other sources, and which can be transferred to similar culturally diverse institutions to increase success for students who have socio-economic challenges. 
    more » « less