skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Using ChatGPT as a tool for training nonprogrammers to generate genomic sequence analysis code
Abstract Today, due to the size of many genomes and the increasingly large sizes of sequencing files, independently analyzing sequencing data is largely impossible for a biologist with little to no programming expertise. As such, biologists are typically faced with the dilemma of either having to spend a significant amount of time and effort to learn how to program themselves or having to identify (and rely on) an available computer scientist to analyze large sequence data sets. That said, the advent of AI‐powered programs like ChatGPT may offer a means of circumventing the disconnect between biologists and their analysis of genomic data critically important to their field. The work detailed herein demonstrates how implementing ChatGPT into an existing Course‐based Undergraduate Research Experience curriculum can provide a means for equipping biology students with no programming expertise the power to generate their own programs and allow those students to carry out a publishable, comprehensive analysis of real‐world Next Generation Sequencing (NGS) datasets. Relying solely on the students' biology background as a prompt for directing ChatGPT to generate Python codes, we found students could readily generate programs able to deal with and analyze NGS datasets greater than 10 gigabytes. In summary, we believe that integrating ChatGPT into education can help bridge a critical gap between biology and computer science and may prove similarly beneficial in other disciplines. Additionally, ChatGPT can provide biological researchers with powerful new tools capable of mediating NGS dataset analysis to help accelerate major new advances in the field.  more » « less
Award ID(s):
2219900 2243532
PAR ID:
10613933
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Wiley Periodicals LLC
Date Published:
Journal Name:
Biochemistry and Molecular Biology Education
ISSN:
1470-8175
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The publication of the first human genome in 2001 transformed biomedical research. Since then, an explosion of new sequencing technologies has required engineers and computer scientists to invent computational methods to analyze and interpret the ever-growing data. Now, large-scale biological data encompasses many types of ‘omics’ datasets, including genomes, transcriptomes, proteomes and metabolomes, and each of these new datasets has created a new set of analytical challenges. To meet this need, the field of bioinformatics has expanded significantly, but there is still a large need for engineers and scientists to work in this inherently interdisciplinary field. Properly trained bioinformaticians have expertise in computer science/engineering and understand the biological and medical context underlying their work. Therefore, the development of robust bioinformatics training programs is critical to educate the next generation of bioinformaticians. Although undergraduate degree programs in bioinformatics exist, providing students with hands-on bioinformatics skills through immersive research experiences is necessary to prepare students for graduate work. Thus, this work describes a recently funded NSF – International Research Experience for Students (IRES) site: US-Sweden Clinical Bioinformatics Research Training Program targeted at training students from diverse educational backgrounds to prepare them for authentic bioinformatics research experiences. Given the inherent interdisciplinary nature of bioinformatics, it is extremely difficult to design a training program that prepares students from different backgrounds (computer science, bioengineering, computational biology, biology) to be successful in a bioinformatics research group. Therefore, this ‘Work-in-Progress’ describes the pre-departure training program developed for this IRES site and the initial lessons learned. 
    more » « less
  2. This work-in-progress paper presents a study that sheds light on the concerns that students may not develop sufficient programming skills and as a result, be less competent with the use of ChatGPT. The potential benefits for students are significant: Access to ChatGPT increases the ability for students to work constructively on their own schedule. The ease of use of ChatGPT may engage students who might otherwise hesitate in asking for support. Before these tools can be meaningfully introduced into a course, work must be done to study the impact of these AI tools on a student's ability to learn. In this study, participants are recruited from introductory Java programming courses at a large public university in the United States. This paper presents preliminary findings from a mixed method study design that consists of a pre-task assessment quiz; and a programming task in one of three conditions: (1) with no external help, (2) with the help of an AI chatbot, or (3) with the help of a generative AI tool like GitHub Copilot; followed by a post-task assessment and an interview on their experience and perceptions of the tools. Our preliminary findings describe our data collection, thematic analysis of the students' prompts and chatGPT responses, and a summary of the experience for 3 students. Our findings demonstrate a range of students' attitudes and behaviors towards chatGPT that provides insight for future research and plans for incorporating such AI tools in a course. 
    more » « less
  3. Software product line engineering is a best practice for managing reuse in families of software systems. In this work, we explore the use of product line engineering in the emerging programming domain of synthetic biology. In synthetic biology, living organisms are programmed to perform new functions or improve existing functions. These programs are designed and constructed using small building blocks made out of DNA. We conjecture that there are families of products that consist of common and variable DNA parts, and we can leverage product line engineering to help synthetic biologists build, evolve, and reuse these programs. As a first step towards this goal, we perform a domain engineering case study that leverages an open-source repository of more than 45,000 reusable DNA parts. We are able to identify features and their related artifacts, all of which can be composed to make different programs. We demonstrate that we can successfully build feature models representing families for two commonly engineered functions. We then analyze an existing synthetic biology case study and demonstrate how product line engineering can be beneficial in this domain. 
    more » « less
  4. This Work-in-Progress paper in the Research Category uses a retrospective mixed-methods study to better understand the factors that mediate learning of computational modeling by life scientists. Key stakeholders, including leading scientists, universities and funding agencies, have promoted computational modeling to enable life sciences research and improve the translation of genetic and molecular biology high- throughput data into clinical results. Software platforms to facilitate computational modeling by biologists who lack advanced mathematical or programming skills have had some success, but none has achieved widespread use among life scientists. Because computational modeling is a core engineering skill of value to other STEM fields, it is critical for engineering and computer science educators to consider how we help students from across STEM disciplines learn computational modeling. Currently we lack sufficient research on how best to help life scientists learn computational modeling. To address this gap, in 2017, we observed a short-format summer course designed for life scientists to learn computational modeling. The course used a simulation environment designed to lower programming barriers. We used semi-structured interviews to understand students' experiences while taking the course and in applying computational modeling after the course. We conducted interviews with graduate students and post- doctoral researchers who had completed the course. We also interviewed students who took the course between 2010 and 2013. Among these past attendees, we selected equal numbers of interview subjects who had and had not successfully published journal articles that incorporated computational modeling. This Work-in-Progress paper applies social cognitive theory to analyze the motivations of life scientists who seek training in computational modeling and their attitudes towards computational modeling. Additionally, we identify important social and environmental variables that influence successful application of computational modeling after course completion. The findings from this study may therefore help us educate biomedical and biological engineering students more effectively. Although this study focuses on life scientists, its findings can inform engineering and computer science education more broadly. Insights from this study may be especially useful in aiding incoming engineering and computer science students who do not have advanced mathematical or programming skills and in preparing undergraduate engineering students for collaborative work with life scientists. 
    more » « less
  5. The ability to predict student performance in introductory programming courses is important to help struggling students and enhance their persistence. However, for this prediction to be impactful, it is crucial that it remains transparent and accessible for both instructors and students, ensuring effective utilization of the predicted results. Machine learning models with explainable features provide an effective means for students and instructors to comprehend students' diverse programming behaviors and problem-solving strategies, elucidating the factors contributing to both successful and suboptimal performance. This study develops an explainable model that predicts student performance based on programming assignment submission information in different stages of the course to enable early explainable predictions. We extract data-driven features from student programming submissions and utilize a stacked ensemble model for predicting final exam grades. The experimental results suggest that our model successfully predicts student performance based on their programming submissions earlier in the semester. Employing SHAP, a game-theory-based framework, we explain the model's predictions, aiding stakeholders in understanding the influence of diverse programming behaviors on students' success. Additionally, we analyze crucial features, employing a mix of descriptive statistics and mixture models to identify distinct student profiles based on their problem-solving patterns, enhancing overall explainability. Furthermore, we dive deeper and analyze the profiles using different programming patterns of the students to elucidate the characteristics of different students where SHAP explanations are not comprehensible. Our explainable early prediction model elucidates common problem-solving patterns in students relative to their expertise, facilitating effective intervention and adaptive support. 
    more » « less