skip to main content

Title: Pathways and Structures: Evaluating Systems Changes in an NSF INCLUDES Alliance

In this article, we reflect on our experience applying a framework for evaluating systems change to an evaluation of a statewide West Virginia alliance funded by the National Science Foundation (NSF) to improve the early persistence of rural, first-generation, and other underrepresented minority science, technology, engineering, and mathematics (STEM) students in their programs of study. We begin with a description of the project and then discuss the two pillars around which we have built our evaluation of this project. Next, we present the challenge we confronted (despite the utility of our two pillars) in identifying and analyzing systems change, as well as the literature we consulted as we considered how to address this difficulty. Finally, we describe the framework we applied and examine how it helped us and where we still faced quandaries. Ultimately, this reflection serves two key purposes: (1) to consider a few of the challenges of measuring changes in systems, and (2) to discuss our experience applying one framework to address these issues.

Authors:
 ;  ;  ;  
Award ID(s):
1834569 1834586 1834575 1834595 1834601
Publication Date:
NSF-PAR ID:
10361234
Journal Name:
American Journal of Evaluation
Volume:
43
Issue:
4
Page Range or eLocation-ID:
p. 632-646
ISSN:
1098-2140
Publisher:
SAGE Publications
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do notmore »have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA.« less
  2. ABSTRACT This is a cross institution project involving four Institutes of Technology in Ireland. The objective of this project is to assess the use of technology to enhance the assessment of laboratory sessions in Science and Health. In science, health and engineering, the laboratory sessions are at the core of the learning process for skill development. These laboratory sessions focus on the skills acquisition. The Irish Institute of Technology sector, in particular, develops these skills and considers them essential for ‘professionally ready’ graduates. In terms of student progression and retention, the assessment structure has been identified as having a significant impact on student engagement. The Technology Enhanced Assessment Methods (TEAM) project led by Dundalk Institute of Technology and partnering with Institute of Technology Sligo, Athlone Institute of Technology and Institute of Technology Carlow is exploring the potential offered by digital technologies to address these concerns. It aims to develop a framework for applying the principles of effective assessment and feedback to practical assessment. The TEAM project also aims to facilitate dialogue among stakeholders about what it is we want student to learn in laboratory sessions and how our assessment can facilitate this. A peer network of discipline-specific academics and studentsmore »in the Science and Health field has been established across all four Institutes. As the network focuses on authentic skills assessment in all core modules, including physics and chemistry, the best practice from this project will inform future assessment procedures across laboratory sessions and may be considered for application within a Science and Materials Engineering context. Assessing the skills acquired in this environment takes many forms. Using student and stakeholder feedback along with an extensive literature review of the area, the team identified key technologies that cut across science and health disciplines, with the potential to influence and enable the learning process. The emphasis was on developing a powerful learning environment approach to enable students to deepen their learning through engagement with the process. The areas identified are: (i) Pre-practical preparation (videos and quizzes), (ii) Electronic laboratory notebooks and ePortfolios, (iii) Digital Feedback technologies and (iv) Rubrics). This paper describes the student experience and perceptions of the adoption of digital technology in science practical assessments. It also describes the process involved in setting up the pilot structure and it presents the initial results from the student survey.« less
  3. Our research, Landscapes of Deep Time in the Red Earth of France (NSF International Research Experience for Students project), aims to mentor U.S. undergraduate science students from underserved populations (e.g. students of Native American heritage and/or first-generation college students) in geological research. During the first field season (June 2018) formative and summative assessments (outlined below) will be issued to assist in our evaluation of student learning. The material advancement of a student's sedimentological skillsets and self-efficacy development in research applications are a direct measure of our program's success. (1) Immediately before and after the program, students will self-rank their competency of specific skillsets (e.g. data collection, lithologic description, use of field equipment) in an anonymous summative assessment. (2) Formative assessments throughout the field season (e.g. describing stratigraphic section independently, oral and written communication of results) will assess improved comprehension of the scientific process. (3) An anonymous attitudinal survey will be issued at the conclusion of the field season to shed light on the program's quality as a whole, influence on student desire to pursue a higher-level degree/career in STEM, and effectiveness of the program on aiding the development of participant confidence and self-efficacy in research design and application. We discussmore »herein the results of first-year assessments with a focus on strategies for improvement. We expect each individual's outcomes to differ depending on his/her own characteristics and background. Furthermore, some of the most valued intentions of this experience are inherently difficult to measure (e.g., improved understanding of the scientific process, a stimulated passion to pursue a STEM career). We hope to address shortcomings in design; e.g. Where did we lose visibility on certain aspects of the learning experience? How can we revise the format and content of our assessment to better evaluate student participants and improve our program in subsequent years?« less
  4. Need/Motivation (e.g., goals, gaps in knowledge) The ESTEEM implemented a STEM building capacity project through students’ early access to a sustainable and innovative STEM Stepping Stones, called Micro-Internships (MI). The goal is to reap key benefits of a full-length internship and undergraduate research experiences in an abbreviated format, including access, success, degree completion, transfer, and recruiting and retaining more Latinx and underrepresented students into the STEM workforce. The MIs are designed with the goals to provide opportunities for students at a community college and HSI, with authentic STEM research and applied learning experiences (ALE), support for appropriate STEM pathway/career, preparation and confidence to succeed in STEM and engage in summer long REUs, and with improved outcomes. The MI projects are accessible early to more students and build momentum to better overcome critical obstacles to success. The MIs are shorter, flexibly scheduled throughout the year, easily accessible, and participation in multiple MI is encouraged. ESTEEM also establishes a sustainable and collaborative model, working with partners from BSCS Science Education, for MI’s mentor, training, compliance, and building capacity, with shared values and practices to maximize the improvement of student outcomes. New Knowledge (e.g., hypothesis, research questions) Research indicates that REU/internship experiences canmore »be particularly powerful for students from Latinx and underrepresented groups in STEM. However, those experiences are difficult to access for many HSI-community college students (85% of our students hold off-campus jobs), and lack of confidence is a barrier for a majority of our students. The gap between those who can and those who cannot is the “internship access gap.” This project is at a central California Community College (CCC) and HSI, the only affordable post-secondary option in a region serving a historically underrepresented population in STEM, including 75% Hispanic, and 87% have not completed college. MI is designed to reduce inequalities inherent in the internship paradigm by providing access to professional and research skills for those underserved students. The MI has been designed to reduce barriers by offering: shorter duration (25 contact hours); flexible timing (one week to once a week over many weeks); open access/large group; and proximal location (on-campus). MI mentors participate in week-long summer workshops and ongoing monthly community of practice with the goal of co-constructing a shared vision, engaging in conversations about pedagogy and learning, and sustaining the MI program going forward. Approach (e.g., objectives/specific aims, research methodologies, and analysis) Research Question and Methodology: We want to know: How does participation in a micro-internship affect students’ interest and confidence to pursue STEM? We used a mixed-methods design triangulating quantitative Likert-style survey data with interpretive coding of open-responses to reveal themes in students’ motivations, attitudes toward STEM, and confidence. Participants: The study sampled students enrolled either part-time or full-time at the community college. Although each MI was classified within STEM, they were open to any interested student in any major. Demographically, participants self-identified as 70% Hispanic/Latinx, 13% Mixed-Race, and 42 female. Instrument: Student surveys were developed from two previously validated instruments that examine the impact of the MI intervention on student interest in STEM careers and pursuing internships/REUs. Also, the pre- and post (every e months to assess longitudinal outcomes) -surveys included relevant open response prompts. The surveys collected students’ demographics; interest, confidence, and motivation in pursuing a career in STEM; perceived obstacles; and past experiences with internships and MIs. 171 students responded to the pre-survey at the time of submission. Outcomes (e.g., preliminary findings, accomplishments to date) Because we just finished year 1, we lack at this time longitudinal data to reveal if student confidence is maintained over time and whether or not students are more likely to (i) enroll in more internships, (ii) transfer to a four-year university, or (iii) shorten the time it takes for degree attainment. For short term outcomes, students significantly Increased their confidence to continue pursuing opportunities to develop within the STEM pipeline, including full-length internships, completing STEM degrees, and applying for jobs in STEM. For example, using a 2-tailed t-test we compared means before and after the MI experience. 15 out of 16 questions that showed improvement in scores were related to student confidence to pursue STEM or perceived enjoyment of a STEM career. Finding from the free-response questions, showed that the majority of students reported enrolling in the MI to gain knowledge and experience. After the MI, 66% of students reported having gained valuable knowledge and experience, and 35% of students spoke about gaining confidence and/or momentum to pursue STEM as a career. Broader Impacts (e.g., the participation of underrepresented minorities in STEM; development of a diverse STEM workforce, enhanced infrastructure for research and education) The ESTEEM project has the potential for a transformational impact on STEM undergraduate education’s access and success for underrepresented and Latinx community college students, as well as for STEM capacity building at Hartnell College, a CCC and HSI, for students, faculty, professionals, and processes that foster research in STEM and education. Through sharing and transfer abilities of the ESTEEM model to similar institutions, the project has the potential to change the way students are served at an early and critical stage of their higher education experience at CCC, where one in every five community college student in the nation attends a CCC, over 67% of CCC students identify themselves with ethnic backgrounds that are not White, and 40 to 50% of University of California and California State University graduates in STEM started at a CCC, thus making it a key leverage point for recruiting and retaining a more diverse STEM workforce.« less
  5. Security is a critical aspect in the design, development, and testing of software systems. Due to the increasing need for security-related skills within software systems and engineering, there is a growing demand for these skills to be taught at the university level. A series of 41 security modules was developed to assess the impact of these modules on teaching critical cyber security topics to students. This paper presents the implementation and outcomes of the first set of six security modules in a Freshman level course. This set consists of five modules presented in lectures as well as a sixth module emphasizing encryption and decryption used as the semester project for the course. Each module is a collection of concepts related to cyber security. The individual cyber security concepts are presented with a general description of a security issue to avoid, sample code with the security issue written in the Java programming language, and a second version of the code with an effective solution. The set of these modules was implemented in Computer Science I during the Fall 2019 semester. Incorporating each of the concepts in these modules into lectures depends on both the topic covered and the approach to resolvingmore »the related security issue. Students were introduced to computing concepts related to both the security issue and the appropriate solution to fully grasp the overall concept. After presenting the materials to students, continual review with students is also essential. This reviewal process requires exploring use-cases for the programming mechanisms presented as solutions to the security issues discussed. In addition to the security modules presented in lectures, students were given a hands-on approach to understanding the concepts through Model-Eliciting Activities (MEAs). MEAs are open-ended, problem-solving activities in which groups of three to four students work to solve realistic complex problems in a classroom setting. The semester project related to encryption and decryption was implemented into the course as an MEA. To assess the effectiveness of incorporating security modules with the MEA project into the curriculum of Computer Science I, two sections of the course were used as a control group and a treatment group. The treatment group included the security modules in lectures and the MEA project while the control group did not. To measure the overall effectiveness of incorporating security modules with the MEA project, both the instructor’s effectiveness as well as the student’s attitudes and interest were measured. For instructors, the primary question to address was to what extent do instructors change their attitudes towards student learning and their teaching practices because of the implementation of cyber security modules through MEAs. For students, the primary question to address was how the inclusion of security modules with the MEA project improved their understanding of the course materials and their interests in computer science. After implementing security modules with the MEA project, students showed a better understanding of cyber security concepts and a greater interest in broader computer science concepts. The instructor’s beliefs about teaching, learning, and assessment shifted from teacher-centered to student-centered, during his experience with the security modules and MEA.« less