NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rapid GPU-Based Pangenome Graph Layout

https://doi.org/10.1109/SC41406.2024.00035

Li, Jiajie; Schmelzle, Jan-Niklas; Du, Yixiao; Heumos, Simon; Guarracino, Andrea; Guidi, Giulia; Prins, Pjotr; Garrison, Erik; Zhang, Zhiru (November 2024, IEEE)

Free, publicly-accessible full text available November 17, 2025
Cluster-efficient pangenome graph construction with nf-core/pangenome

https://doi.org/10.1093/bioinformatics/btae609

Heumos, Simon; Heuer, Michael L; Hanssen, Friederike; Heumos, Lukas; Guarracino, Andrea; Heringer, Peter; Ehmele, Philipp; Prins, Pjotr; Garrison, Erik; Nahnsen, Sven (November 2024, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. ResultsTo overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core’s best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. Availability and implementationnf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.
more » « less
Free, publicly-accessible full text available November 1, 2025
Pangenome-Informed Language Models for Synthetic Genome Sequence Generation

https://doi.org/10.1101/2024.09.18.612131

Huang, Pengzhi; Charton, François; Schmelzle, Jan-Niklas M; Darnell, Shelby S; Prins, Pjotr; Garrison, Erik; Suh, G Edward (September 2024, bioRxiv)

Abstract Language Models (LM) have been extensively utilized for learning DNA sequence patterns and generating synthetic sequences. In this paper, we present a novel approach for the generation of synthetic DNA data using pangenomes in combination with LM. We introduce three innovative pangenome-based tokenization schemes, including two that can decouple from private data, while enhance long DNA sequence generation. Our experimental results demonstrate the superiority of pangenome-based tokenization over classical methods in generating high-utility synthetic DNA sequences, highlighting a promising direction for the public sharing of genomic datasets.
more » « less
Full Text Available
Pangenome graph layout by Path-Guided Stochastic Gradient Descent

https://doi.org/10.1093/bioinformatics/btae363

Heumos, Simon; Guarracino, Andrea; Schmelzle, Jan-Niklas M; Li, Jiajie; Zhang, Zhiru; Hagmann, Jörg; Nahnsen, Sven; Prins, Pjotr; Garrison, Erik (July 2024, Bioinformatics)
Robinson, Peter (Ed.)
Abstract MotivationThe increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph’s potential excessive size, this is a significant challenge. ResultsIn response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability and implementationWe integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
more » « less
Full Text Available
Creating a biomedical knowledge base by addressing GPT inaccurate responses and benchmarking context

https://doi.org/10.1101/2024.10.16.618663

Darnell, S Solomon; Overall, Rupert W; Guarracino, Andrea; Colonna, Vicenza; Villani, Flavia; Garrison, Erik; Isaac, Arun; Muli, Priscilla; Muriithi, Frederick Muriuki; Kabui, Alexander; et al (October 2024, bioRxiv)

We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer’s and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT ‘hallucinations’, we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and ‘citizen scientists’ on the relevance of GPT responses. A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. When evaluating the responses to their questions, human respondents give a “thumbs-up” 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT-generates questions, RAGAS scores 74% on answer relevance. With RAGAS we created a benchmark that can be used to continuously assess the performance of our knowledge base. Full GNQA functionality is embedded in the freeGeneNetwork.orgweb service, an open-source system containing over 25 years of experimental data on model organisms and human. The code developed for this study is published under a free and open-source software license athttps://git.genenetwork.org/gn-ai/tree/README.md.
more » « less
Full Text Available
Building pangenome graphs

https://doi.org/10.1038/s41592-024-02430-3

Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago; Kubica, Christian; et al (November 2024, Nature Methods)

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.
more » « less
Free, publicly-accessible full text available November 1, 2025
ODGI: understanding pangenome graphs

https://doi.org/10.1093/bioinformatics/btac308

Guarracino, Andrea; Heumos, Simon; Nahnsen, Sven; Prins, Pjotr; Garrison, Erik (May 2022, Bioinformatics)
Robinson, Peter (Ed.)
Abstract Motivation Pangenome graphs provide a complete representation of the mutual alignment of collections of genomes. These models offer the opportunity to study the entire genomic diversity of a population, including structurally complex regions. Nevertheless, analyzing hundreds of gigabase-scale genomes using pangenome graphs is difficult as it is not well-supported by existing tools. Hence, fast and versatile software is required to ask advanced questions to such data in an efficient way. Results We wrote Optimized Dynamic Genome/Graph Implementation (ODGI), a novel suite of tools that implements scalable algorithms and has an efficient in-memory representation of DNA pangenome graphs in the form of variation graphs. ODGI supports pre-built graphs in the Graphical Fragment Assembly format. ODGI includes tools for detecting complex regions, extracting pangenomic loci, removing artifacts, exploratory analysis, manipulation, validation and visualization. Its fast parallel execution facilitates routine pangenomic tasks, as well as pipelines that can quickly answer complex biological questions of gigabase-scale pangenome graphs. Availability and implementation ODGI is published as free software under the MIT open source license. Source code can be downloaded from https://github.com/pangenome/odgi and documentation is available at https://odgi.readthedocs.io. ODGI can be installed via Bioconda https://bioconda.github.io/recipes/odgi/README.html or GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Insights from the Biorepository and Integrative Genomics pediatric resource

https://doi.org/10.1038/s41467-025-59375-0

Buonaiuto, Silvia; Marsico, Franco; Mohammed, Akram; Chinthala, Lokesh K; Amos-Abanyie, Ernestine K; Baras, Aris; Abecasis, Goncalo; Ferrando, Adolfo; Coppola, Giovanni; Deubler, Andrew; et al (December 2025, Nature Communications)

Abstract The Biorepository and Integrative Genomics (BIG) Initiative in Tennessee has developed a pioneering resource to address gaps in genomic research by linking genomic, phenotypic, and environmental data from a diverse Mid-South population, including underrepresented groups. We analyzed 13,152 exomes from BIG and found significant genetic diversity, with 50% of participants inferred to have non-European or several types of admixed ancestry. Ancestry within the BIG cohort is stratified, with distinct geographic and demographic patterns, as African ancestry is more common in urban areas, while European ancestry is more common in suburban regions. We observe ancestry-specific rates of novel genetic variants, which are enriched for functional or clinical relevance. Disease prevalence analysis linked ancestry and environmental factors, showing higher odds ratios for asthma and obesity in minority groups, particularly in the urban area. Finally, we observe discrepancies between self-reported race and genetic ancestry, with related individuals self-identifying in differing racial categories. These findings underscore the limitations of race as a biomedical variable. BIG has proven to be an effective model for community-centered precision medicine. We integrated genomics education, and fostered great trust among the contributing communities. Future goals include cohort expansion, and enhanced genomic analysis, to ensure equitable healthcare outcomes.
more » « less
Free, publicly-accessible full text available December 1, 2026
Building pangenome graphs

https://doi.org/10.1101/2023.04.05.535718

Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago; Kubica, Christian; et al (April 2023, bioRxiv)

Abstract Pangenome graphs can represent all variation between multiple genomes, but existing methods for constructing them are biased due to reference-guided approaches. In response, we have developed PanGenome Graph Builder (PGGB), a reference-free pipeline for constructing unbi-ased pangenome graphs. PGGB uses all-to-all whole-genome alignments and learned graph embeddings to build and iteratively refine a model in which we can identify variation, measure conservation, detect recombination events, and infer phylogenetic relationships.
more » « less
Full Text Available
Journal of Open Source Software (JOSS): design and first-year review

https://doi.org/10.7717/peerj-cs.147

Smith, Arfon M.; Niemeyer, Kyle E.; Katz, Daniel S.; Barba, Lorena A.; Githinji, George; Gymrek, Melissa; Huff, Kathryn D.; Madan, Christopher R.; Cabunoc Mayes, Abigail; Moerman, Kevin M.; et al (January 2018, PeerJ Computer Science)
null (Ed.)
This article describes the motivation, design, and progress of the Journal of Open Source Software (JOSS). JOSS is a free and open-access journal that publishes articles describing research software. It has the dual goals of improving the quality of the software submitted and providing a mechanism for research software developers to receive credit. While designed to work within the current merit system of science, JOSS addresses the dearth of rewards for key contributions to science made in the form of software. JOSS publishes articles that encapsulate scholarship contained in the software itself, and its rigorous peer review targets the software components: functionality, documentation, tests, continuous integration, and the license. A JOSS article contains an abstract describing the purpose and functionality of the software, references, and a link to the software archive. The article is the entry point of a JOSS submission, which encompasses the full set of software artifacts. Submission and review proceed in the open, on GitHub. Editors, reviewers, and authors work collaboratively and openly. Unlike other journals, JOSS does not reject articles requiring major revision; while not yet accepted, articles remain visible and under review until the authors make adequate changes (or withdraw, if unable to meet requirements). Once an article is accepted, JOSS gives it a digital object identifier (DOI), deposits its metadata in Crossref, and the article can begin collecting citations on indexers like Google Scholar and other services. Authors retain copyright of their JOSS article, releasing it under a Creative Commons Attribution 4.0 International License. In its first year, starting in May 2016, JOSS published 111 articles, with more than 40 additional articles under review. JOSS is a sponsored project of the nonprofit organization NumFOCUS and is an affiliate of the Open Source Initiative (OSI).
more » « less
Full Text Available

« Prev Next »

Search for: All records