NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DTI-LM: language model powered drug–target interaction prediction

https://doi.org/10.1093/bioinformatics/btae533

Ahmed, Khandakar_Tanvir; Ansari, Md_Istiaq; Zhang, Wei; Martelli, ed., Pier_Luigi (September 2024, Bioinformatics)

Abstract MotivationThe identification and understanding of drug–target interactions (DTIs) play a pivotal role in the drug discovery and development process. Sequence representations of drugs and proteins in computational model offer advantages such as their widespread availability, easier input quality control, and reduced computational resource requirements. These make them an efficient and accessible tools for various computational biology and drug discovery applications. Many sequence-based DTI prediction methods have been developed over the years. Despite the advancement in methodology, cold start DTI prediction involving unknown drug or protein remains a challenging task, particularly for sequence-based models. Introducing DTI-LM, a novel framework leveraging advanced pretrained language models, we harness their exceptional context-capturing abilities along with neighborhood information to predict DTIs. DTI-LM is specifically designed to rely solely on sequence representations for drugs and proteins, aiming to bridge the gap between warm start and cold start predictions. ResultsLarge-scale experiments on four datasets show that DTI-LM can achieve state-of-the-art performance on DTI predictions. Notably, it excels in overcoming the common challenges faced by sequence-based models in cold start predictions for proteins, yielding impressive results. The incorporation of neighborhood information through a graph attention network further enhances prediction accuracy. Nevertheless, a disparity persists between cold start predictions for proteins and drugs. A detailed examination of DTI-LM reveals that language models exhibit contrasting capabilities in capturing similarities between drugs and proteins. Availability and implementationSource code is available at: https://github.com/compbiolabucf/DTI-LM.
more » « less
Galaxy Helm chart: a standardized method for deploying production Galaxy servers

https://doi.org/10.1093/bioinformatics/btae486

Goonasekera, Nuwan; Mahmoud, Alexandru; Suderman, Keith; Afgan, Enis; Martelli, ed., Pier_Luigi (August 2024, Bioinformatics)

Abstract MotivationThe Galaxy application is a popular open-source framework for data intensive sciences, counting thousands of monthly users across more than 100 public servers. To support a growing number of users and a greater variety of use cases, the complexity of a production-grade Galaxy installation has also grown, requiring more administration effort. There is a need for a rapid and reproducible Galaxy deployment method that can be maintained at high-availability with minimal maintenance. ResultsWe describe the Galaxy Helm chart that codifies all elements of a production-grade Galaxy installation into a single package. Deployable on Kubernetes clusters, the chart encapsulates supporting software services and implements the best-practices model for running Galaxy. It is also the most rapid method available for deploying a scalable, production-grade Galaxy instance on one’s own infrastructure. The chart is highly configurable, allowing systems administrators to swap dependent services if desired. Notable uses of the chart include on-demand, fully-automated deployments on AnVIL, providing training infrastructure for the Bioconductor project, and as the AWS-recommended solution for running Galaxy on the Amazon cloud. Availability and implementationThe source code for Galaxy Helm is available at https://github.com/galaxyproject/galaxy-helm, the corresponding Helm package at https://github.com/CloudVE/helm-charts, and the required Galaxy container image https://github.com/galaxyproject/galaxy-docker-k8s.
more » « less
SEraster: a rasterization preprocessing framework for scalable spatial omics data analysis

https://doi.org/10.1093/bioinformatics/btae412

Aihara, Gohta; Clifton, Kalen; Chen, Mayling; Li, Zhuoyan; Atta, Lyla; Miller, Brendan_F; Satija, Rahul; Hickey, John_W; Fan, Jean; Martelli, ed., Pier_Luigi (June 2024, Bioinformatics)

Abstract MotivationSpatial omics data demand computational analysis but many analysis tools have computational resource requirements that increase with the number of cells analyzed. This presents scalability challenges as researchers use spatial omics technologies to profile millions of cells. ResultsTo enhance the scalability of spatial omics data analysis, we developed a rasterization preprocessing framework called SEraster that aggregates cellular information into spatial pixels. We apply SEraster to both real and simulated spatial omics data prior to spatial variable gene expression analysis to demonstrate that such preprocessing can reduce computational resource requirements while maintaining high performance, including as compared to other down-sampling approaches. We further integrate SEraster with existing analysis tools to characterize cell-type spatial co-enrichment across length scales. Finally, we apply SEraster to enable analysis of a mouse pup spatial omics dataset with over a million cells to identify tissue-level and cell-type-specific spatially variable genes as well as spatially co-enriched cell types that recapitulate expected organ structures. Availability and implementationSEraster is implemented as an R package on GitHub (https://github.com/JEFworks-Lab/SEraster) with additional tutorials at https://JEF.works/SEraster.
more » « less
Inferring delays in partially observed gene regulation processes

https://doi.org/10.1093/bioinformatics/btad670

Hong, Hyukpyo; Cortez, Mark_Jayson; Cheng, Yu-Yu; Kim, Hang_Joon; Choi, Boseung; Josić, Krešimir; Kim, Jae_Kyoung; Martelli, ed., Pier_Luigi (November 2023, Bioinformatics)

Abstract MotivationCell function is regulated by gene regulatory networks (GRNs) defined by protein-mediated interaction between constituent genes. Despite advances in experimental techniques, we can still measure only a fraction of the processes that govern GRN dynamics. To infer the properties of GRNs using partial observation, unobserved sequential processes can be replaced with distributed time delays, yielding non-Markovian models. Inference methods based on the resulting model suffer from the curse of dimensionality. ResultsWe develop a simulation-based Bayesian MCMC method employing an approximate likelihood for the efficient and accurate inference of GRN parameters when only some of their products are observed. We illustrate our approach using a two-step activation model: an activation signal leads to the accumulation of an unobserved regulatory protein, which triggers the expression of observed fluorescent proteins. With prior information about observed fluorescent protein synthesis, our method successfully infers the dynamics of the unobserved regulatory protein. We can estimate the delay and kinetic parameters characterizing target regulation including transcription, translation, and target searching of an unobserved protein from experimental measurements of the products of its target gene. Our method is scalable and can be used to analyze non-Markovian models with hidden components. Availability and implementationOur code is implemented in R and is freely available with a simple example data at https://github.com/Mathbiomed/SimMCMC.
more » « less
Accurately modeling biased random walks on weighted networks using node2vec+

https://doi.org/10.1093/bioinformatics/btad047

Liu, Renming; Hirn, Matthew; Krishnan, Arjun; Martelli, ed., Pier_Luigi (January 2023, Bioinformatics)

Abstract MotivationAccurately representing biological networks in a low-dimensional space, also known as network embedding, is a critical step in network-based machine learning and is carried out widely using node2vec, an unsupervised method based on biased random walks. However, while many networks, including functional gene interaction networks, are dense, weighted graphs, node2vec is fundamentally limited in its ability to use edge weights during the biased random walk generation process, thus under-using all the information in the network. ResultsHere, we present node2vec+, a natural extension of node2vec that accounts for edge weights when calculating walk biases and reduces to node2vec in the cases of unweighted graphs or unbiased walks. Using two synthetic datasets, we empirically show that node2vec+ is more robust to additive noise than node2vec in weighted graphs. Then, using genome-scale functional gene networks to solve a wide range of gene function and disease prediction tasks, we demonstrate the superior performance of node2vec+ over node2vec in the case of weighted graphs. Notably, due to the limited amount of training data in the gene classification tasks, graph neural networks such as GCN and GraphSAGE are outperformed by both node2vec and node2vec+. Availability and implementationThe data and code are available on GitHub at https://github.com/krishnanlab/node2vecplus_benchmarks. All additional data underlying this article are available on Zenodo at https://doi.org/10.5281/zenodo.7007164. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less

Search for: All records