NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Meta-colored Compacted de Bruijn Graphs

Pibiri, Giulio Ermanno; Fan, Jason; Patro, Rob (May 2024, Springer Nature)
Ma, Jian (Ed.)
Full Text Available
Fulgor: a fast and compact k-mer index for large-scale matching and color queries

https://doi.org/10.1186/s13015-024-00251-9

Fan, Jason; Khan, Jamshed; Singh, Noor_Pratap; Pibiri, Giulio_Ermanno; Patro, Rob (January 2024, Algorithms for Molecular Biology)

Abstract The problem of sequence identification or matching—determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence—is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficientcolored de Bruijngraph index, arising as the combination of ak-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph aremonochromatic(i.e., allk-mers in a unitig have the same set of references of origin, orcolor). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map fromk-mers to their colors in as little as 1 +o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called , and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to —the strongest competitor in terms of index space vs. query time trade-off— requires significantly less space (up to 43% less space for a collection of 150,000Salmonella entericagenomes), is at least twice as fast for color queries, and is 2–6$$\times$$ $\times$ faster to construct.
more » « less
Fulgor: A Fast and Compact k-mer Index for Large-Scale Matching and Color Queries

https://doi.org/10.4230/LIPIcs.WABI.2023.18

Fan, Jason; Singh, Noor Pratap; Khan, Jamshed; Pibiri, Giulio Ermanno; Patro, Rob (August 2023, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023))
Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)
The problem of sequence identification or matching - determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2-6 × faster to construct.
more » « less
Perplexity: evaluating transcript abundance estimation in the absence of ground truth

https://doi.org/10.1186/s13015-022-00214-y

Fan, Jason; Chan, Skylar; Patro, Rob (December 2022, Algorithms for Molecular Biology)

Abstract Background There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Results We derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. Furthermore, we demonstrate theoretically and experimentally that perplexity can be computed for arbitrary transcript abundance estimation models. Conclusions Alongside the derivation and implementation of perplexity for transcript abundance estimation, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
more » « less
Full Text Available
Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

https://doi.org/10.4230/LIPIcs.WABI.2021.4

Fan, Jason; Chan, Skylar; Patro, Rob (July 2021, 21stInternational Workshop on Algorithms in Bioinformatics (WABI 2021).)
Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.
more » « less
Full Text Available
Matrix (factorization) reloaded: flexible methods for imputing genetic interactions with cross-species and side information

https://doi.org/10.1093/bioinformatics/btaa818

Fan, Jason; Li, Xuan Cindy; Crovella, Mark; Leiserson, Mark D (December 2020, Bioinformatics)
null (Ed.)
Abstract Motivation Mapping genetic interactions (GIs) can reveal important insights into cellular function and has potential translational applications. There has been great progress in developing high-throughput experimental systems for measuring GIs (e.g. with double knockouts) as well as in defining computational methods for inferring (imputing) unknown interactions. However, existing computational methods for imputation have largely been developed for and applied in baker’s yeast, even as experimental systems have begun to allow measurements in other contexts. Importantly, existing methods face a number of limitations in requiring specific side information and with respect to computational cost. Further, few have addressed how GIs can be imputed when data are scarce. Results In this article, we address these limitations by presenting a new imputation framework, called Extensible Matrix Factorization (EMF). EMF is a framework of composable models that flexibly exploit cross-species information in the form of GI data across multiple species, and arbitrary side information in the form of kernels (e.g. from protein–protein interaction networks). We perform a rigorous set of experiments on these models in matched GI datasets from baker’s and fission yeast. These include the first such experiments on genome-scale GI datasets in multiple species in the same study. We find that EMF models that exploit side and cross-species information improve imputation, especially in data-scarce settings. Further, we show that EMF outperforms the state-of-the-art deep learning method, even when using strictly less data, and incurs orders of magnitude less computational cost. Availability Implementations of models and experiments are available at: https://github.com/lrgr/EMF. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Research in Computational Molecular Biology 27th Annual International Conference, RECOMB 2023, Istanbul, Turkey, April 16–19, 2023, Proceedings

Luo, Runpeng; Lin, Yu; Fan, Jason; Khan, Jamshed; Pibiri, Giulio_Ermanno; Patro, Rob; Tabatabaee, Yasamin; Roch, Sébastien; Warnow, Tandy; Chandra, Ghanshyam; et al (April 2023, Springer Cham)
Tang, Haixu (Ed.)
This book constitutes the refereed proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2023, held in Istanbul, Turkey, from April 16–19, 2023. The 11 regular and 33 short papers presented in this book were carefully reviewed and selected from 188 submissions. The papers report on original research in all areas of computational molecular biology and bioinformatics.
more » « less
Full Text Available
Functional protein representations from biological networks enable diverse cross-species inference

https://doi.org/10.1093/nar/gkz132

Fan, Jason; Cannistra, Anthony; Fried, Inbar; Lim, Tim; Schaffner, Thomas; Crovella, Mark; Hescott, Benjamin; Leiserson, Mark D (March 2019, Nucleic Acids Research)

Full Text Available
Revealing Ultra-High-Energy Gamma-Ray Emission from the eHWC J1825-134 Region with HAWC

https://doi.org/10.22323/1.444.0796

Albert, Andrea; Alfaro, Ruben Jose; Alvarez, César; Andres, Alexis; Arteaga Velazquez, Juan Carlos; Avila Rojas, Daniel Omar; Ayala Solares, Hugo Alberto; Babu, Rishi; Belmont-Moreno, Ernesto; Capistrán Rojas, Tomás; et al (July 2023, ICRC2023)

Full Text Available
The HAWC ultra-high-energy gamma-ray map with more than 5 years of data

https://doi.org/10.22323/1.444.0698

Harding, Pat; Albert, Andrea; Alfaro, Ruben Jose; Alvarez, César; Andres, Alexis; Arteaga Velazquez, Juan Carlos; Avila Rojas, Daniel Omar; Ayala Solares, Hugo Alberto; Babu, Rishi; Belmont-Moreno, Ernesto; et al (July 2023, ICRC2023)

Full Text Available

« Prev Next »

Search for: All records