skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 5, 2026

Title: Using ChatGPT as a tool for training nonprogrammers to generate genomic sequence analysis code
Abstract Today, due to the size of many genomes and the increasingly large sizes of sequencing files, independently analyzing sequencing data is largely impossible for a biologist with little to no programming expertise. As such, biologists are typically faced with the dilemma of either having to spend a significant amount of time and effort to learn how to program themselves or having to identify (and rely on) an available computer scientist to analyze large sequence data sets. That said, the advent of AI‐powered programs like ChatGPT may offer a means of circumventing the disconnect between biologists and their analysis of genomic data critically important to their field. The work detailed herein demonstrates how implementing ChatGPT into an existing Course‐based Undergraduate Research Experience curriculum can provide a means for equipping biology students with no programming expertise the power to generate their own programs and allow those students to carry out a publishable, comprehensive analysis of real‐world Next Generation Sequencing (NGS) datasets. Relying solely on the students' biology background as a prompt for directing ChatGPT to generate Python codes, we found students could readily generate programs able to deal with and analyze NGS datasets greater than 10 gigabytes. In summary, we believe that integrating ChatGPT into education can help bridge a critical gap between biology and computer science and may prove similarly beneficial in other disciplines. Additionally, ChatGPT can provide biological researchers with powerful new tools capable of mediating NGS dataset analysis to help accelerate major new advances in the field.  more » « less
Award ID(s):
2219900 2243532
PAR ID:
10613933
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Wiley Periodicals LLC
Date Published:
Journal Name:
Biochemistry and Molecular Biology Education
ISSN:
1470-8175
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract New computational methods and next‐generation sequencing (NGS) approaches have enabled the use of thousands or hundreds of thousands of genetic markers to address previously intractable questions. The methods and massive marker sets present both new data analysis challenges and opportunities to visualize, understand, and apply population and conservation genomic data in novel ways. The large scale and complexity of NGS data also increases the expertise and effort required to thoroughly and thoughtfully analyze and interpret data. To aid in this endeavor, a recent workshop entitled “Population Genomic Data Analysis,” also known as “ConGen 2017,” was held at the University of Montana. The ConGen workshop brought 15 instructors together with knowledge in a wide range of topics including NGS data filtering, genome assembly, genomic monitoring of effective population size, migration modeling, detecting adaptive genomic variation, genomewide association analysis, inbreeding depression, and landscape genomics. Here, we summarize the major themes of the workshop and the important take‐home points that were offered to students throughout. We emphasize increasing participation by women in population and conservation genomics as a vital step for the advancement of science. Some important themes that emerged during the workshop included the need for data visualization and its importance in finding problematic data, the effects of data filtering choices on downstream population genomic analyses, the increasing availability of whole‐genome sequencing, and the new challenges it presents. Our goal here is to help motivate and educate a worldwide audience to improve population genomic data analysis and interpretation, and thereby advance the contribution of genomics to molecular ecology, evolutionary biology, and especially to the conservation of biodiversity. 
    more » « less
  2. null (Ed.)
    The publication of the first human genome in 2001 transformed biomedical research. Since then, an explosion of new sequencing technologies has required engineers and computer scientists to invent computational methods to analyze and interpret the ever-growing data. Now, large-scale biological data encompasses many types of ‘omics’ datasets, including genomes, transcriptomes, proteomes and metabolomes, and each of these new datasets has created a new set of analytical challenges. To meet this need, the field of bioinformatics has expanded significantly, but there is still a large need for engineers and scientists to work in this inherently interdisciplinary field. Properly trained bioinformaticians have expertise in computer science/engineering and understand the biological and medical context underlying their work. Therefore, the development of robust bioinformatics training programs is critical to educate the next generation of bioinformaticians. Although undergraduate degree programs in bioinformatics exist, providing students with hands-on bioinformatics skills through immersive research experiences is necessary to prepare students for graduate work. Thus, this work describes a recently funded NSF – International Research Experience for Students (IRES) site: US-Sweden Clinical Bioinformatics Research Training Program targeted at training students from diverse educational backgrounds to prepare them for authentic bioinformatics research experiences. Given the inherent interdisciplinary nature of bioinformatics, it is extremely difficult to design a training program that prepares students from different backgrounds (computer science, bioengineering, computational biology, biology) to be successful in a bioinformatics research group. Therefore, this ‘Work-in-Progress’ describes the pre-departure training program developed for this IRES site and the initial lessons learned. 
    more » « less
  3. This Work-in-Progress paper in the Research Category uses a retrospective mixed-methods study to better understand the factors that mediate learning of computational modeling by life scientists. Key stakeholders, including leading scientists, universities and funding agencies, have promoted computational modeling to enable life sciences research and improve the translation of genetic and molecular biology high- throughput data into clinical results. Software platforms to facilitate computational modeling by biologists who lack advanced mathematical or programming skills have had some success, but none has achieved widespread use among life scientists. Because computational modeling is a core engineering skill of value to other STEM fields, it is critical for engineering and computer science educators to consider how we help students from across STEM disciplines learn computational modeling. Currently we lack sufficient research on how best to help life scientists learn computational modeling. To address this gap, in 2017, we observed a short-format summer course designed for life scientists to learn computational modeling. The course used a simulation environment designed to lower programming barriers. We used semi-structured interviews to understand students' experiences while taking the course and in applying computational modeling after the course. We conducted interviews with graduate students and post- doctoral researchers who had completed the course. We also interviewed students who took the course between 2010 and 2013. Among these past attendees, we selected equal numbers of interview subjects who had and had not successfully published journal articles that incorporated computational modeling. This Work-in-Progress paper applies social cognitive theory to analyze the motivations of life scientists who seek training in computational modeling and their attitudes towards computational modeling. Additionally, we identify important social and environmental variables that influence successful application of computational modeling after course completion. The findings from this study may therefore help us educate biomedical and biological engineering students more effectively. Although this study focuses on life scientists, its findings can inform engineering and computer science education more broadly. Insights from this study may be especially useful in aiding incoming engineering and computer science students who do not have advanced mathematical or programming skills and in preparing undergraduate engineering students for collaborative work with life scientists. 
    more » « less
  4. Software product line engineering is a best practice for managing reuse in families of software systems. In this work, we explore the use of product line engineering in the emerging programming domain of synthetic biology. In synthetic biology, living organisms are programmed to perform new functions or improve existing functions. These programs are designed and constructed using small building blocks made out of DNA. We conjecture that there are families of products that consist of common and variable DNA parts, and we can leverage product line engineering to help synthetic biologists build, evolve, and reuse these programs. As a first step towards this goal, we perform a domain engineering case study that leverages an open-source repository of more than 45,000 reusable DNA parts. We are able to identify features and their related artifacts, all of which can be composed to make different programs. We demonstrate that we can successfully build feature models representing families for two commonly engineered functions. We then analyze an existing synthetic biology case study and demonstrate how product line engineering can be beneficial in this domain. 
    more » « less
  5. null (Ed.)
    Abstract Background Next-generation sequencing (NGS) is widely used for genome-wide identification and quantification of DNA elements involved in the regulation of gene transcription. Studies that generate multiple high-throughput NGS datasets require data integration methods for two general tasks: 1) generation of genome-wide data tracks representing an aggregate of multiple replicates of the same experiment; and 2) combination of tracks from different experimental types that provide complementary information regarding the location of genomic features such as enhancers. Results NGS-Integrator is a Java-based command line application, facilitating efficient integration of multiple genome-wide NGS datasets. NGS-Integrator first transforms all input data tracks using the complement of the minimum Bayes’ factor so that all values are expressed in the range [0,1] representing the probability of a true signal given the background noise. Then, NGS-Integrator calculates the joint probability for every genomic position to create an integrated track. We provide examples using real NGS data generated in our laboratory and from the mouse ENCODE database. Conclusions Our results show that NGS-Integrator is both time- and memory-efficient. Our examples show that NGS-Integrator can integrate information to facilitate downstream analyses that identify functional regulatory domains along the genome. 
    more » « less