Search for: All records

Creators/Authors contains: "Guarracino, Andrea"

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Rapid GPU-Based Pangenome Graph Layout

https://doi.org/10.1109/SC41406.2024.00035

Li, Jiajie; Schmelzle, Jan-Niklas; Du, Yixiao; Heumos, Simon; Guarracino, Andrea; Guidi, Giulia; Prins, Pjotr; Garrison, Erik; Zhang, Zhiru (November 2024, IEEE)

Full Text Available
Cluster-efficient pangenome graph construction with nf-core/pangenome

https://doi.org/10.1093/bioinformatics/btae609

Heumos, Simon; Heuer, Michael L; Hanssen, Friederike; Heumos, Lukas; Guarracino, Andrea; Heringer, Peter; Ehmele, Philipp; Prins, Pjotr; Garrison, Erik; Nahnsen, Sven (November 2024, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. ResultsTo overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core’s best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. Availability and implementationnf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.
more » « less
Full Text Available
Recurrent evolution and selection shape structural diversity at the amylase locus

https://doi.org/10.1038/s41586-024-07911-1

Bolognini, Davide; Halgren, Alma; Lou, Runyang Nicolas; Raveane, Alessandro; Rocha, Joana L; Guarracino, Andrea; Soranzo, Nicole; Chin, Chen-Shan; Garrison, Erik; Sudmant, Peter H (October 2024, Nature)

Abstract The adoption of agriculture triggered a rapid shift towards starch-rich diets in human populations¹. Amylase genes facilitate starch digestion, and increased amylase copy number has been observed in some modern human populations with high-starch intake², although evidence of recent selection is lacking^3,4. Here, using 94 long-read haplotype-resolved assemblies and short-read data from approximately 5,600 contemporary and ancient humans, we resolve the diversity and evolutionary history of structural variation at the amylase locus. We find that amylase genes have higher copy numbers in agricultural populations than in fishing, hunting and pastoral populations. We identify 28 distinct amylase structural architectures and demonstrate that nearly identical structures have arisen recurrently on different haplotype backgrounds throughout recent human history.AMY1andAMY2Agenes each underwent multiple duplication/deletion events with mutation rates up to more than 10,000-fold the single-nucleotide polymorphism mutation rate, whereasAMY2Bgene duplications share a single origin. Using a pangenome-based approach, we infer structural haplotypes across thousands of humans identifying extensively duplicated haplotypes at higher frequency in modern agricultural populations. Leveraging 533 ancient human genomes, we find that duplication-containing haplotypes (with more gene copies than the ancestral haplotype) have rapidly increased in frequency over the past 12,000 years in West Eurasians, suggestive of positive selection. Together, our study highlights the potential effects of the agricultural revolution on human genomes and the importance of structural variation in human adaptation.
more » « less
Full Text Available
Pangenome graph layout by Path-Guided Stochastic Gradient Descent

https://doi.org/10.1093/bioinformatics/btae363

Heumos, Simon; Guarracino, Andrea; Schmelzle, Jan-Niklas M; Li, Jiajie; Zhang, Zhiru; Hagmann, Jörg; Nahnsen, Sven; Prins, Pjotr; Garrison, Erik (July 2024, Bioinformatics)
Robinson, Peter (Ed.)
Abstract MotivationThe increasing availability of complete genomes demands for models to study genomic variability within entire populations. Pangenome graphs capture the full genomic similarity and diversity between multiple genomes. In order to understand them, we need to see them. For visualization, we need a human-readable graph layout: a graph embedding in low (e.g. two) dimensional depictions. Due to a pangenome graph’s potential excessive size, this is a significant challenge. ResultsIn response, we introduce a novel graph layout algorithm: the Path-Guided Stochastic Gradient Descent (PG-SGD). PG-SGD uses the genomes, represented in the pangenome graph as paths, as an embedded positional system to sample genomic distances between pairs of nodes. This avoids the quadratic cost seen in previous versions of graph drawing by SGD. We show that our implementation efficiently computes the low-dimensional layouts of gigabase-scale pangenome graphs, unveiling their biological features. Availability and implementationWe integrated PG-SGD in ODGI which is released as free software under the MIT open source license. Source code is available at https://github.com/pangenome/odgi.
more » « less
Full Text Available
The formation and propagation of human Robertsonian chromosomes

https://doi.org/10.1101/2024.09.24.614821

Gomes_de_Lima, Leonardo; Guarracino, Andrea; Koren, Sergey; Potapova, Tamara; McKinney, Sean; Rhie, Arang; Solar, Steven J; Seidel, Chris; Fagen, Brandon; Walenz, Brian P; et al (September 2024, bioRxiv)

Abstract Robertsonian chromosomes are a type of variant chromosome found commonly in nature. Present in one in 800 humans, these chromosomes can underlie infertility, trisomies, and increased cancer incidence. Recognized cytogenetically for more than a century, their origins have remained mysterious. Recent advances in genomics allowed us to assemble three human Robertsonian chromosomes completely. We identify a common breakpoint and epigenetic changes in centromeres that provide insight into the formation and propagation of common Robertsonian translocations. Further investigation of the assembled genomes of chimpanzee and bonobo highlights the structural features of the human genome that uniquely enable the specific crossover event that creates these chromosomes. Resolving the structure and epigenetic features of human Robertsonian chromosomes at a molecular level paves the way to understanding how chromosomal structural variation occurs more generally, and how chromosomes evolve.
more » « less
Full Text Available
Creating a biomedical knowledge base by addressing GPT inaccurate responses and benchmarking context

https://doi.org/10.1101/2024.10.16.618663

Darnell, S Solomon; Overall, Rupert W; Guarracino, Andrea; Colonna, Vicenza; Villani, Flavia; Garrison, Erik; Isaac, Arun; Muli, Priscilla; Muriithi, Frederick Muriuki; Kabui, Alexander; et al (October 2024, bioRxiv)

We created GNQA, a generative pre-trained transformer (GPT) knowledge base driven by a performant retrieval augmented generation (RAG) with a focus on aging, dementia, Alzheimer’s and diabetes. We uploaded a corpus of three thousand peer reviewed publications on these topics into the RAG. To address concerns about inaccurate responses and GPT ‘hallucinations’, we implemented a context provenance tracking mechanism that enables researchers to validate responses against the original material and to get references to the original papers. To assess the effectiveness of contextual information we collected evaluations and feedback from both domain expert users and ‘citizen scientists’ on the relevance of GPT responses. A key innovation of our study is automated evaluation by way of a RAG assessment system (RAGAS). RAGAS combines human expert assessment with AI-driven evaluation to measure the effectiveness of RAG systems. When evaluating the responses to their questions, human respondents give a “thumbs-up” 76% of the time. Meanwhile, RAGAS scores 90% on answer relevance on questions posed by experts. And when GPT-generates questions, RAGAS scores 74% on answer relevance. With RAGAS we created a benchmark that can be used to continuously assess the performance of our knowledge base. Full GNQA functionality is embedded in the freeGeneNetwork.orgweb service, an open-source system containing over 25 years of experimental data on model organisms and human. The code developed for this study is published under a free and open-source software license athttps://git.genenetwork.org/gn-ai/tree/README.md.
more » « less
Full Text Available
Gapless assembly of complete human and plant chromosomes using only nanopore sequencing

https://doi.org/10.1101/gr.279334.124

Koren, Sergey; Bao, Zhigui; Guarracino, Andrea; Ou, Shujun; Goodwin, Sara; Jenike, Katharine M; Lucas, Julian; McNulty, Brandy; Park, Jimin; Rautiainen, Mikko; et al (November 2024, Genome Research)

The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, “telomere-to-telomere” genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT “Duplex” sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used “Pore-C” chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes.
more » « less
Full Text Available
Unbiased pangenome graphs

https://doi.org/10.1093/bioinformatics/btac743

Garrison, Erik; Guarracino, Andrea (November 2022, Bioinformatics)
Alkan, Can (Ed.)
Abstract Motivation Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. Results We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. Availability and implementation seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.
more » « less
Full Text Available
Building pangenome graphs

https://doi.org/10.1038/s41592-024-02430-3

Garrison, Erik; Guarracino, Andrea; Heumos, Simon; Villani, Flavia; Bao, Zhigui; Tattini, Lorenzo; Hagmann, Jörg; Vorbrugg, Sebastian; Marco-Sola, Santiago; Kubica, Christian; et al (November 2024, Nature Methods)

Pangenome graphs can represent all variation between multiple reference genomes, but current approaches to build them exclude complex sequences or are based upon a single reference. In response, we developed the PanGenome Graph Builder, a pipeline for constructing pangenome graphs without bias or exclusion. The PanGenome Graph Builder uses all-to-all alignments to build a variation graph in which we can identify variation, measure conservation, detect recombination events and infer phylogenetic relationships.
more » « less
Full Text Available
Optimal gap-affine alignment in O ( s ) space

https://doi.org/10.1093/bioinformatics/btad074

Marco-Sola, Santiago; Eizenga, Jordan M; Guarracino, Andrea; Paten, Benedict; Garrison, Erik; Moreto, Miquel (February 2023, Bioinformatics)
Martelli, Pier Luigi (Ed.)
Abstract MotivationPairwise sequence alignment remains a fundamental problem in computational biology and bioinformatics. Recent advances in genomics and sequencing technologies demand faster and scalable algorithms that can cope with the ever-increasing sequence lengths. Classical pairwise alignment algorithms based on dynamic programming are strongly limited by quadratic requirements in time and memory. The recently proposed wavefront alignment algorithm (WFA) introduced an efficient algorithm to perform exact gap-affine alignment in O(ns) time, where s is the optimal score and n is the sequence length. Notwithstanding these bounds, WFA’s O(s2) memory requirements become computationally impractical for genome-scale alignments, leading to a need for further improvement. ResultsIn this article, we present the bidirectional WFA algorithm, the first gap-affine algorithm capable of computing optimal alignments in O(s) memory while retaining WFA’s time complexity of O(ns). As a result, this work improves the lowest known memory bound O(n) to compute gap-affine alignments. In practice, our implementation never requires more than a few hundred MBs aligning noisy Oxford Nanopore Technologies reads up to 1 Mbp long while maintaining competitive execution times. Availability and implementationAll code is publicly available at https://github.com/smarco/BiWFA-paper. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Full Text Available

« Prev Next »