skip to main content


Title: Towards computational reproducibility: researcher perspectives on the use and sharing of software

Research software, which includes both source code and executables used as part of the research process, presents a significant challenge for efforts aimed at ensuring reproducibility. In order to inform such efforts, we conducted a survey to better understand the characteristics of research software as well as how it is created, used, and shared by researchers. Based on the responses of 215 participants, representing a range of research disciplines, we found that researchers create, use, and share software in a wide variety of forms for a wide variety of purposes, including data collection, data analysis, data visualization, data cleaning and organization, and automation. More participants indicated that they use open source software than commercial software. While a relatively small number of programming languages (e.g., Python, R, JavaScript, C++, MATLAB) are used by a large number, there is a long tail of languages used by relatively few. Between-group comparisons revealed that significantly more participants from computer science write source code and create executables than participants from other disciplines. Differences between researchers from computer science and other disciplines related to the knowledge of best practices of software creation and sharing were not statistically significant. While many participants indicated that they draw a distinction between the sharing and preservation of software, related practices and perceptions were often not aligned with those of the broader scholarly communications community.

 
more » « less
NSF-PAR ID:
10075375
Author(s) / Creator(s):
 ;  
Publisher / Repository:
PeerJ
Date Published:
Journal Name:
PeerJ Computer Science
Volume:
4
ISSN:
2376-5992
Page Range / eLocation ID:
e163
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, Iyad ; Picone, Joseph ; Selesnick, Ivan (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing a large open source database of high-resolution digital pathology images known as the Temple University Digital Pathology Corpus (TUDP) [1]. Our long-term goal is to release one million images. We expect to release the first 100,000 image corpus by December 2020. The data is being acquired at the Department of Pathology at Temple University Hospital (TUH) using a Leica Biosystems Aperio AT2 scanner [2] and consists entirely of clinical pathology images. More information about the data and the project can be found in Shawki et al. [3]. We currently have a National Science Foundation (NSF) planning grant [4] to explore how best the community can leverage this resource. One goal of this poster presentation is to stimulate community-wide discussions about this project and determine how this valuable resource can best meet the needs of the public. The computing infrastructure required to support this database is extensive [5] and includes two HIPAA-secure computer networks, dual petabyte file servers, and Aperio’s eSlide Manager (eSM) software [6]. We currently have digitized over 50,000 slides from 2,846 patients and 2,942 clinical cases. There is an average of 12.4 slides per patient and 10.5 slides per case with one report per case. The data is organized by tissue type as shown below: Filenames: tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_0a001_00123456_lvl0001_s000.svs tudp/v1.0.0/svs/gastro/000001/00123456/2015_03_05/0s15_12345/0s15_12345_00123456.docx Explanation: tudp: root directory of the corpus v1.0.0: version number of the release svs: the image data type gastro: the type of tissue 000001: six-digit sequence number used to control directory complexity 00123456: 8-digit patient MRN 2015_03_05: the date the specimen was captured 0s15_12345: the clinical case name 0s15_12345_0a001_00123456_lvl0001_s000.svs: the actual image filename consisting of a repeat of the case name, a site code (e.g., 0a001), the type and depth of the cut (e.g., lvl0001) and a token number (e.g., s000) 0s15_12345_00123456.docx: the filename for the corresponding case report We currently recognize fifteen tissue types in the first installment of the corpus. The raw image data is stored in Aperio’s “.svs” format, which is a multi-layered compressed JPEG format [3,7]. Pathology reports containing a summary of how a pathologist interpreted the slide are also provided in a flat text file format. A more complete summary of the demographics of this pilot corpus will be presented at the conference. Another goal of this poster presentation is to share our experiences with the larger community since many of these details have not been adequately documented in scientific publications. There are quite a few obstacles in collecting this data that have slowed down the process and need to be discussed publicly. Our backlog of slides dates back to 1997, meaning there are a lot that need to be sifted through and discarded for peeling or cracking. Additionally, during scanning a slide can get stuck, stalling a scan session for hours, resulting in a significant loss of productivity. Over the past two years, we have accumulated significant experience with how to scan a diverse inventory of slides using the Aperio AT2 high-volume scanner. We have been working closely with the vendor to resolve many problems associated with the use of this scanner for research purposes. This scanning project began in January of 2018 when the scanner was first installed. The scanning process was slow at first since there was a learning curve with how the scanner worked and how to obtain samples from the hospital. From its start date until May of 2019 ~20,000 slides we scanned. In the past 6 months from May to November we have tripled that number and how hold ~60,000 slides in our database. This dramatic increase in productivity was due to additional undergraduate staff members and an emphasis on efficient workflow. The Aperio AT2 scans 400 slides a day, requiring at least eight hours of scan time. The efficiency of these scans can vary greatly. When our team first started, approximately 5% of slides failed the scanning process due to focal point errors. We have been able to reduce that to 1% through a variety of means: (1) best practices regarding daily and monthly recalibrations, (2) tweaking the software such as the tissue finder parameter settings, and (3) experience with how to clean and prep slides so they scan properly. Nevertheless, this is not a completely automated process, making it very difficult to reach our production targets. With a staff of three undergraduate workers spending a total of 30 hours per week, we find it difficult to scan more than 2,000 slides per week using a single scanner (400 slides per night x 5 nights per week). The main limitation in achieving this level of production is the lack of a completely automated scanning process, it takes a couple of hours to sort, clean and load slides. We have streamlined all other aspects of the workflow required to database the scanned slides so that there are no additional bottlenecks. To bridge the gap between hospital operations and research, we are using Aperio’s eSM software. Our goal is to provide pathologists access to high quality digital images of their patients’ slides. eSM is a secure website that holds the images with their metadata labels, patient report, and path to where the image is located on our file server. Although eSM includes significant infrastructure to import slides into the database using barcodes, TUH does not currently support barcode use. Therefore, we manage the data using a mixture of Python scripts and manual import functions available in eSM. The database and associated tools are based on proprietary formats developed by Aperio, making this another important point of community-wide discussion on how best to disseminate such information. Our near-term goal for the TUDP Corpus is to release 100,000 slides by December 2020. We hope to continue data collection over the next decade until we reach one million slides. We are creating two pilot corpora using the first 50,000 slides we have collected. The first corpus consists of 500 slides with a marker stain and another 500 without it. This set was designed to let people debug their basic deep learning processing flow on these high-resolution images. We discuss our preliminary experiments on this corpus and the challenges in processing these high-resolution images using deep learning in [3]. We are able to achieve a mean sensitivity of 99.0% for slides with pen marks, and 98.9% for slides without marks, using a multistage deep learning algorithm. While this dataset was very useful in initial debugging, we are in the midst of creating a new, more challenging pilot corpus using actual tissue samples annotated by experts. The task will be to detect ductal carcinoma (DCIS) or invasive breast cancer tissue. There will be approximately 1,000 images per class in this corpus. Based on the number of features annotated, we can train on a two class problem of DCIS or benign, or increase the difficulty by increasing the classes to include DCIS, benign, stroma, pink tissue, non-neoplastic etc. Those interested in the corpus or in participating in community-wide discussions should join our listserv, nedc_tuh_dpath@googlegroups.com, to be kept informed of the latest developments in this project. You can learn more from our project website: https://www.isip.piconepress.com/projects/nsf_dpath. 
    more » « less
  2. null (Ed.)
    The DeepLearningEpilepsyDetectionChallenge: design, implementation, andtestofanewcrowd-sourced AIchallengeecosystem Isabell Kiral*, Subhrajit Roy*, Todd Mummert*, Alan Braz*, Jason Tsay, Jianbin Tang, Umar Asif, Thomas Schaffter, Eren Mehmet, The IBM Epilepsy Consortium◊ , Joseph Picone, Iyad Obeid, Bruno De Assis Marques, Stefan Maetschke, Rania Khalaf†, Michal Rosen-Zvi† , Gustavo Stolovitzky† , Mahtab Mirmomeni† , Stefan Harrer† * These authors contributed equally to this work † Corresponding authors: rkhalaf@us.ibm.com, rosen@il.ibm.com, gustavo@us.ibm.com, mahtabm@au1.ibm.com, sharrer@au.ibm.com ◊ Members of the IBM Epilepsy Consortium are listed in the Acknowledgements section J. Picone and I. Obeid are with Temple University, USA. T. Schaffter is with Sage Bionetworks, USA. E. Mehmet is with the University of Illinois at Urbana-Champaign, USA. All other authors are with IBM Research in USA, Israel and Australia. Introduction This decade has seen an ever-growing number of scientific fields benefitting from the advances in machine learning technology and tooling. More recently, this trend reached the medical domain, with applications reaching from cancer diagnosis [1] to the development of brain-machine-interfaces [2]. While Kaggle has pioneered the crowd-sourcing of machine learning challenges to incentivise data scientists from around the world to advance algorithm and model design, the increasing complexity of problem statements demands of participants to be expert data scientists, deeply knowledgeable in at least one other scientific domain, and competent software engineers with access to large compute resources. People who match this description are few and far between, unfortunately leading to a shrinking pool of possible participants and a loss of experts dedicating their time to solving important problems. Participation is even further restricted in the context of any challenge run on confidential use cases or with sensitive data. Recently, we designed and ran a deep learning challenge to crowd-source the development of an automated labelling system for brain recordings, aiming to advance epilepsy research. A focus of this challenge, run internally in IBM, was the development of a platform that lowers the barrier of entry and therefore mitigates the risk of excluding interested parties from participating. The challenge: enabling wide participation With the goal to run a challenge that mobilises the largest possible pool of participants from IBM (global), we designed a use case around previous work in epileptic seizure prediction [3]. In this “Deep Learning Epilepsy Detection Challenge”, participants were asked to develop an automatic labelling system to reduce the time a clinician would need to diagnose patients with epilepsy. Labelled training and blind validation data for the challenge were generously provided by Temple University Hospital (TUH) [4]. TUH also devised a novel scoring metric for the detection of seizures that was used as basis for algorithm evaluation [5]. In order to provide an experience with a low barrier of entry, we designed a generalisable challenge platform under the following principles: 1. No participant should need to have in-depth knowledge of the specific domain. (i.e. no participant should need to be a neuroscientist or epileptologist.) 2. No participant should need to be an expert data scientist. 3. No participant should need more than basic programming knowledge. (i.e. no participant should need to learn how to process fringe data formats and stream data efficiently.) 4. No participant should need to provide their own computing resources. In addition to the above, our platform should further • guide participants through the entire process from sign-up to model submission, • facilitate collaboration, and • provide instant feedback to the participants through data visualisation and intermediate online leaderboards. The platform The architecture of the platform that was designed and developed is shown in Figure 1. The entire system consists of a number of interacting components. (1) A web portal serves as the entry point to challenge participation, providing challenge information, such as timelines and challenge rules, and scientific background. The portal also facilitated the formation of teams and provided participants with an intermediate leaderboard of submitted results and a final leaderboard at the end of the challenge. (2) IBM Watson Studio [6] is the umbrella term for a number of services offered by IBM. Upon creation of a user account through the web portal, an IBM Watson Studio account was automatically created for each participant that allowed users access to IBM's Data Science Experience (DSX), the analytics engine Watson Machine Learning (WML), and IBM's Cloud Object Storage (COS) [7], all of which will be described in more detail in further sections. (3) The user interface and starter kit were hosted on IBM's Data Science Experience platform (DSX) and formed the main component for designing and testing models during the challenge. DSX allows for real-time collaboration on shared notebooks between team members. A starter kit in the form of a Python notebook, supporting the popular deep learning libraries TensorFLow [8] and PyTorch [9], was provided to all teams to guide them through the challenge process. Upon instantiation, the starter kit loaded necessary python libraries and custom functions for the invisible integration with COS and WML. In dedicated spots in the notebook, participants could write custom pre-processing code, machine learning models, and post-processing algorithms. The starter kit provided instant feedback about participants' custom routines through data visualisations. Using the notebook only, teams were able to run the code on WML, making use of a compute cluster of IBM's resources. The starter kit also enabled submission of the final code to a data storage to which only the challenge team had access. (4) Watson Machine Learning provided access to shared compute resources (GPUs). Code was bundled up automatically in the starter kit and deployed to and run on WML. WML in turn had access to shared storage from which it requested recorded data and to which it stored the participant's code and trained models. (5) IBM's Cloud Object Storage held the data for this challenge. Using the starter kit, participants could investigate their results as well as data samples in order to better design custom algorithms. (6) Utility Functions were loaded into the starter kit at instantiation. This set of functions included code to pre-process data into a more common format, to optimise streaming through the use of the NutsFlow and NutsML libraries [10], and to provide seamless access to the all IBM services used. Not captured in the diagram is the final code evaluation, which was conducted in an automated way as soon as code was submitted though the starter kit, minimising the burden on the challenge organising team. Figure 1: High-level architecture of the challenge platform Measuring success The competitive phase of the "Deep Learning Epilepsy Detection Challenge" ran for 6 months. Twenty-five teams, with a total number of 87 scientists and software engineers from 14 global locations participated. All participants made use of the starter kit we provided and ran algorithms on IBM's infrastructure WML. Seven teams persisted until the end of the challenge and submitted final solutions. The best performing solutions reached seizure detection performances which allow to reduce hundred-fold the time eliptologists need to annotate continuous EEG recordings. Thus, we expect the developed algorithms to aid in the diagnosis of epilepsy by significantly shortening manual labelling time. Detailed results are currently in preparation for publication. Equally important to solving the scientific challenge, however, was to understand whether we managed to encourage participation from non-expert data scientists. Figure 2: Primary occupation as reported by challenge participants Out of the 40 participants for whom we have occupational information, 23 reported Data Science or AI as their main job description, 11 reported being a Software Engineer, and 2 people had expertise in Neuroscience. Figure 2 shows that participants had a variety of specialisations, including some that are in no way related to data science, software engineering, or neuroscience. No participant had deep knowledge and experience in data science, software engineering and neuroscience. Conclusion Given the growing complexity of data science problems and increasing dataset sizes, in order to solve these problems, it is imperative to enable collaboration between people with differences in expertise with a focus on inclusiveness and having a low barrier of entry. We designed, implemented, and tested a challenge platform to address exactly this. Using our platform, we ran a deep-learning challenge for epileptic seizure detection. 87 IBM employees from several business units including but not limited to IBM Research with a variety of skills, including sales and design, participated in this highly technical challenge. 
    more » « less
  3. The landscape of research in science and engineering is heavily reliant on computation and data processing. There is continued and expanded usage by disciplines that have historically used advanced computing resources, new usage by disciplines that have not traditionally used HPC, and new modalities of the usage in Data Science, Machine Learning, and other areas of AI. Along with these new patterns have come new advanced computing resource methods and approaches, including the availability of commercial cloud resources. The Coalition for Academic Scientific Computation (CASC) has long been an advocate representing the needs of academic researchers using computational resources, sharing best practices and offering advice to create a national cyberinfrastructure to meet US science, engineering, and other academic computing needs. CASC has completed the first of what we intend to be an annual survey of academic cloud and data center usage and practices in analyzing return on investment in cyberinfrastructure. Critically important findings from this first survey include the following: many of the respondents are engaged in some form of analysis of return in research computing investments, but only a minority currently report the results of such analyses to their upper-level administration. Most respondents are experimenting with use of commercial cloud resources but no respondent indicated that they have found use of commercial cloud services to create financial benefits compared to their current methods. There is clear correlation between levels of investment in research cyberinfrastructure and the scale of both cpu core-hours delivered and the financial level of supported research grants. Also interesting is that almost every respondent indicated that they participate in some sort of national cooperative or nationally provided research computing infrastructure project and most were involved in academic computing-related organizations, indicating a high degree of engagement by institutions of higher education in building and maintaining national research computing ecosystems. Institutions continue to evaluate cloud-based HPC service models, despite having generally concluded that so far cloud HPC is too expensive to use compared to their current methods. 
    more » « less
  4. In September 2019, the fourth and final workshop on the Future of Mechatronics and Robotics Education (FoMRE) was held at a Lawrence Technological University in Southfield, MI. This workshop was organized by faculty at several universities with financial support from industry partners and the National Science Foundation. The purpose of the workshops was to create a cohesive effort among mechatronics and robotics courses, minors and degree programs. Mechatronics and Robotics Engineering (MRE) is an integration of mechanics, controls, electronics, and software, which provides a unique opportunity for engineering students to function on multidisciplinary teams. Due to its multidisciplinary nature, it attracts diverse and innovative students, and graduates better-prepared professional engineers. In this fast growing field, there is a great need to standardize educational material and make MRE education more widely available and easier to adopt. This can only be accomplished if the community comes together to speak with one clear voice about not only the benefits, but also the best ways to teach it. These efforts would also aid in establishing more of these degree programs and integrating minors or majors into existing computer science, mechanical engineering, or electrical engineering departments. The final workshop was attended by approximately 50 practitioners from industry and academia. Participants identified many practical skills required for students to succeed in an MRE curriculum and as practicing engineers after graduation. These skills were then organized into the following categories: professional, independent learning, controller design, numerical simulation and analysis, electronics, software development, and system design. For example, professional skills include technical reports, presentations, and documentation. Independent learning includes reading data sheets, performing internet searches, doing a literature review, and having a maker mindset. Numerical simulation skills include understanding data, presenting data graphically, solving and simulating in software such as MATLAB, Simulink and Excel. Controller design involves selecting a controller, tuning a controller, designing to meet specifications, and understanding when the results are good enough. Electronics skills include selecting sensors, interfacing sensors, interfacing actuators, creating printed circuit boards, wiring on a breadboard, soldering, installing drivers, using integrated circuits, and using microcontrollers. Software development of embedded systems includes agile program design, state machines, analyzing and evaluating code results, commenting code, troubleshooting, debugging, AI and machine learning. Finally, system design includes prototyping, creating CAD models, design for manufacturing, breaking a system down into subsystems, integrating and interfacing subcomponents, having a multidisciplinary perspective, robustness, evaluating tradeoffs, testing, validation, and verification, failure, effect, and mode analysis. A survey was prepared and sent out to the participants from all four workshops as well as other robotics faculty, researchers and industry personnel in order to elicit a broader community response. Because one of the biggest challenges in mechatronics and robotics education is the absence of standardized curricula, textbooks, platforms, syllabi, assignments, and learning outcomes, this was a vital part of the process to achieve some level of consensus. This paper presents an introduction to MRE education, related work on existing programs, methods, results of the practical skills survey, and then draws conclusions based upon these results. It aims to create the foundation for standardizing the development of student skills in mechatronics and robotics curricula across institutions, disciplines, majors and minors. The survey was completed by 94 participants and it was clear that there is a consensus that the primary skills students should have upon completion of MRE courses or a program is a broader multidisciplinary systems-level perspective, an ability to problem solve, and an ability to design a system to meet specifications. 
    more » « less
  5. Nowadays, cyberattack incidents are happening on a daily basis. As a result, the demand for a larger and more challenging workforce is increasing. To handle this demand, academic institutions offer cybersecurity courses and degree programs into their curricula; however, more efforts are needed to address the high demand of the cybersecurity workforce. This work aims to bridge the gap between workforce shortage and the number of qualified graduates to fill the positions. We approach this by introducing cybersecurity concepts at the early stage of undergraduate curricula of computer science and engineering programs. Secure programming is critical as many cybersecurity incidents happen due to software vulnerabilities. However, most UG-level programming courses pay little attention to secure programming practices. As a result, many students graduate with limited knowledge of security vulnerabilities that might plague the developed software. Our goal in this work is to introduce secure programming at introductory level programming courses so that students should be aware of cybersecurity issues and use this security mindset in advanced level courses and projects in their degree programs. To accomplish this goal, we developed intuitive and interactive modules emphasizing secure programming in C++ and Java courses to help students become secure software developers. These modules will be used alongside the coursework to emphasize certain vulnerabilities within the programming environment of a specific language and allow students to learn cybersecurity topics, enforcing a solid foundation and understanding. We developed cybersecurity educational modules for C++ and Java as they are amongst the popular languages and used in introductory programming courses. While designing these modules, we kept in mind that the topics must be relevant to real-world issues in the software industry. We used a variety of resources and benchmarks to ensure the authenticity of our chosen topics, including Common Weakness Enumeration (CWE) and Common Vulnerability and Exposures (CVE). While choosing module topics to develop, we had some restrictions. For example, the topics must be introductory and easy to understand. These modules are geared towards freshman or sophomore-level UG students who have just started programming. The developed security modules have four components: power-point slides, lab description, code template for the lab, and complete solution. The complete solution for each module will be provided to the instructors to check students’ work if they adopt the modules in their courses. The modules developed for a C++ programming course include labs on input validation, integer overflow, random number generation, function call with incorrect argument type, and dangling pointers. In Java, we developed lab modules for input validation, integer overflow, null object reference, random number generator, and data encapsulation. 
    more » « less