NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CoverM: read alignment statistics for metagenomics

https://doi.org/10.1093/bioinformatics/btaf147

Aroney, Samuel_T N; Newell, Rhys_J P; Nissen, Jakob N; Camargo, Antonio Pedro; Tyson, Gene W; Woodcroft, Ben J (March 2025, Bioinformatics)
Alkan, Can (Ed.)
Abstract SummaryGenome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses “Mosdepth arrays” for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results. Availability and implementationCoverM is free software available at https://github.com/wwood/coverm. CoverM is implemented in Rust, with Python (https://github.com/apcamargo/pycoverm) and Julia (https://github.com/JuliaBinaryWrappers/CoverM_jll.jl) interfaces.
more » « less
Free, publicly-accessible full text available March 29, 2026
wgatools : an ultrafast toolkit for manipulating whole-genome alignments

https://doi.org/10.1093/bioinformatics/btaf132

Wei, Wenjie; Gui, Songtao; Yang, Jian; Garrison, Erik; Yan, Jianbing; Liu, Hai-Jun (March 2025, Bioinformatics)
Alkan, Can (Ed.)
Abstract SummaryWith the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole-genome alignment formats, offering practical tools for conversion, processing, evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. Availability and implementationwgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools and https://zenodo.org/records/14882797.
more » « less
Free, publicly-accessible full text available March 29, 2026
Micro-DeMix: a mixture beta-multinomial model for investigating the heterogeneity of the stool microbiome compositions

https://doi.org/10.1093/bioinformatics/btae667

Liu, Ruoqian; Wang, Yue; Cheng, Dan (November 2024, Bioinformatics)
Alkan, Can (Ed.)
Free, publicly-accessible full text available November 28, 2025
Cluster-efficient pangenome graph construction with nf-core/pangenome

https://doi.org/10.1093/bioinformatics/btae609

Heumos, Simon; Heuer, Michael L; Hanssen, Friederike; Heumos, Lukas; Guarracino, Andrea; Heringer, Peter; Ehmele, Philipp; Prins, Pjotr; Garrison, Erik; Nahnsen, Sven (November 2024, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. ResultsTo overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core’s best practices. Leveraging biocontainers ensures portability and seamless deployment in High-Performance Computing (HPC) environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 Escherichia coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. Availability and implementationnf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/docs/usage.
more » « less
Free, publicly-accessible full text available November 1, 2025
LexicHash: sequence similarity estimation via lexicographic comparison of hashes

https://doi.org/10.1093/bioinformatics/btad652

Greenberg, Grant; Ravi, Aditya Narayan; Shomorony, Ilan (November 2023, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationPairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. ResultsIn this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how “lexicographically similar” the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision–recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. Availability and implementationLexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
more » « less
Full Text Available
HQAlign: aligning nanopore reads for SV detection using current-level modeling

https://doi.org/10.1093/bioinformatics/btad580

Joshi, Dhaivat; Diggavi, Suhas; Chaisson, Mark J; Kannan, Sreeram (October 2023, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationDetection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments. ResultsWe show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome. Availability and implementationhttps://github.com/joshidhaivat/HQAlign.git.
more » « less
Full Text Available
Unbiased pangenome graphs

https://doi.org/10.1093/bioinformatics/btac743

Garrison, Erik; Guarracino, Andrea (November 2022, Bioinformatics)
Alkan, Can (Ed.)
Abstract Motivation Pangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines. Results We design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species. Availability and implementation seqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.
more » « less
Full Text Available
SNIKT: sequence-independent adapter identification and removal in long-read shotgun sequencing data

https://doi.org/10.1093/bioinformatics/btac389

Ranjan, Piyush; Brown, Christopher A; Erb-Downward, John R; Dickson, Robert P (June 2022, Bioinformatics)
Alkan, Can (Ed.)
Abstract Summary Here, we introduce SNIKT, a command-line tool for sequence-independent visual confirmation and input-assisted removal of adapter contamination in whole-genome shotgun or metagenomic shotgun long-read sequencing DNA or RNA data. Availability and Implementation SNIKT is implemented in R and is compatible with Unix-like platforms. The source code, along with documentation, is freely available under an MIT license at https://github.com/piyuranjan/SNIKT. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
JBrowseR: an R interface to the JBrowse 2 genome browser

https://doi.org/10.1093/bioinformatics/btab459

Hershberg, Elliot A; Stevens, Garrett; Diesh, Colin; Xie, Peter; De Jesus Martinez, Teresa; Buels, Robert; Stein, Lincoln; Holmes, Ian (November 2021, Bioinformatics)
Alkan, Can (Ed.)
Abstract MotivationGenome browsers are an essential tool in genome analysis. Modern genome browsers enable complex and interactive visualization of a wide variety of genomic data modalities. While such browsers are very powerful, they can be challenging to configure and program for bioinformaticians lacking expertise in web development. ResultsWe have developed an R package that provides an interface to the JBrowse 2 genome browser. The package can be used to configure and customize the browser entirely with R code. The browser can be deployed from the R console, or embedded in Shiny applications or R Markdown documents. Availability and implementationJBrowseR is available for download from CRAN, and the source code is openly available from the Github repository at https://github.com/GMOD/JBrowseR/.
more » « less
Full Text Available

Search for: All records