skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 28, 2026

Title: iSeqSearch: incremental protein search for iBlast/iMMSeqs2/iDiamond
BackgroundThe advancement of sequencing technology has led to a rapid increase in the amount of DNA and protein sequence data; consequently, the size of genomic and proteomic databases is constantly growing. As a result, database searches need to be continually updated to account for the new data being added. However, continually re-searching the entire existing dataset wastes resources. Incremental database search can address this problem. MethodsOne recently introduced incremental search method is iBlast, which wraps the BLAST sequence search method with an algorithm to reuse previously processed data and thereby increase search efficiency. The iBlast wrapper, however, must be generalized to support better performing DNA/protein sequence search methods that have been developed, namely MMseqs2 and Diamond. To address this need, we propose iSeqsSearch, which extends iBlast by incorporating support for MMseqs2 (iMMseqs2) and Diamond (iDiamond), thereby providing a more generalized and broadly effective incremental search framework. Moreover, the previously published iBlast wrapper has to be revised to be more robust and usable by the general community. ResultsiMMseqs2 and iDiamond, which apply the incremental approach, perform nearly identical to MMseqs2 and Diamond. Notably, when comparing ranking comparison methods such as the Pearson correlation, we observe a high concordance of over 0.9, indicating similar results. Moreover, in some cases, our incremental approach, iSeqsSearch, which extends the iBlast merge function to iMMseqs2 and iDiamond, provides more hits compared to the conventional MMseqs2 and Diamond methods. ConclusionThe incremental approach using iMMseqs2 and iDiamond demonstrates efficiency in terms of reusing previously processed data while maintaining high accuracy and concordance in search results. This method can reduce resource waste in continually growing genomic and proteomic database searches. The sample codes and data are available at GitHub and Zenodo (https://github.com/EESI/Incremental-Protein-Search; DOI:10.5281/zenodo.14675319).  more » « less
Award ID(s):
1936791 1936782
PAR ID:
10586592
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
PeerJ
Date Published:
Journal Name:
PeerJ
Volume:
13
ISSN:
2167-8359
Page Range / eLocation ID:
e19171
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract BackgroundComputational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. ResultsIn our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. ConclusionsOur heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly packagehttps://github.com/humengying0907/deconvBenchmarkingandhttps://doi.org/10.5281/zenodo.8206516, enabling further developments in deconvolution methods. 
    more » « less
  2. Abstract While space-borne optical and near-infrared facilities have succeeded in delivering a precise and spatially resolved picture of our Universe, their small survey area is known to underrepresent the true diversity of galaxy populations. Ground-based surveys have reached comparable depths but at lower spatial resolution, resulting in source confusion that hampers accurate photometry extractions. What once was limited to the infrared regime has now begun to challenge ground-based ultradeep surveys, affecting detection and photometry alike. Failing to address these challenges will mean forfeiting a representative view into the distant Universe. We introduceThe Farmer: an automated, reproducible profile-fitting photometry package that pairs a library of smooth parametric models fromThe Tractorwith a decision tree that determines the best-fit model in concert with neighboring sources. Photometry is measured by fitting the models on other bands leaving brightness free to vary. The resulting photometric measurements are naturally total, and no aperture corrections are required. Supporting diagnostics (e.g.,χ2) enable measurement validation. As fitting models is relatively time intensive,The Farmeris built with high-performance computing routines. We benchmarkThe Farmeron a set of realistic COSMOS-like images and find accurate photometry, number counts, and galaxy shapes.The Farmeris already being utilized to produce catalogs for several large-area deep extragalactic surveys where it has been shown to tackle some of the most challenging optical and near-infrared data available, with the promise of extending to other ultradeep surveys expected in the near future.The Farmeris available to download from GitHub (https://github.com/astroweaver/the_farmer) and Zenodo (https://doi.org/10.5281/zenodo.8205817). 
    more » « less
  3. Abstract MotivationTandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. ResultsWe evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses. Availability and ImplementationThe source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu. 
    more » « less
  4. Abstract We presentgrizphotometric light curves for the full 5 yr of the Dark Energy Survey Supernova (DES-SN) program, obtained with both forced point-spread function photometry on difference images (DiffImg) performed during survey operations, and scene modelling photometry (SMP) on search images processed after the survey. This release contains 31,636DiffImgand 19,706 high-quality SMP light curves, the latter of which contain 1635 photometrically classified SNe that pass cosmology quality cuts. This sample spans the largest redshift (z) range ever covered by a single SN survey (0.1 <z< 1.13) and is the largest single sample from a single instrument of SNe ever used for cosmological constraints. We describe in detail the improvements made to obtain the final DES-SN photometry and provide a comparison to what was used in the 3 yr DES-SN spectroscopically confirmed Type Ia SN sample. We also include a comparative analysis of the performance of the SMP photometry with respect to the real-timeDiffImgforced photometry and find that SMP photometry is more precise, more accurate, and less sensitive to the host-galaxy surface brightness anomaly. The public release of the light curves and ancillary data can be found atgithub.com/des-science/DES-SN5YRand doi:10.5281/zenodo.12720777. 
    more » « less
  5. Abstract Observations of core-collapse supernovae (CCSNe) reveal a wealth of information about the dynamics of the supernova ejecta and its composition but very little direct information about the progenitor. Constraining properties of the progenitor and the explosion requires coupling the observations with a theoretical model of the explosion. Here we begin with the CCSN simulations of Couch et al., which use a nonparametric treatment of the neutrino transport while also accounting for turbulence and convection. In this work we use the SuperNova Explosion Code to evolve the CCSN hydrodynamics to later times and compute bolometric light curves. Focusing on Type IIP SNe (SNe IIP), we then (1) directly compare the theoretical STIR explosions to observations and (2) assess how properties of the progenitor’s core can be estimated from optical photometry in the plateau phase alone. First, the distribution of plateau luminosities (L50) and ejecta velocities achieved by our simulations is similar to the observed distributions. Second, we fit our models to the light curves and velocity evolution of some well-observed SNe. Third, we recover well-known correlations, as well as the difficulty of connecting any one SN property to zero-age main-sequence mass. Finally, we show that there is a usable, linear correlation between iron core mass andL50such that optical photometry alone of SNe IIP can give us insights into the cores of massive stars. Illustrating this by application to a few SNe, we find iron core masses of 1.3–1.5Mwith typical errors of 0.05M. Data are publicly available online on Zenodo: doi:10.5281/zenodo.6631964. 
    more » « less