skip to main content

Title: DeepPurpose: a deep learning library for drug–target interaction prediction
Abstract Summary Accurate prediction of drug–target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. Availability and implementation https://github.com/kexinhuang12345/DeepPurpose. Supplementary information Supplementary data are available at Bioinformatics online.
Authors:
; ; ; ; ;
Editors:
Wren, Jonathan
Award ID(s):
2030459
Publication Date:
NSF-PAR ID:
10293085
Journal Name:
Bioinformatics
Volume:
36
Issue:
22-23
Page Range or eLocation-ID:
5545 to 5547
ISSN:
1367-4803
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Anti-cancer drug sensitivity prediction using deep learning models for individual cell line is a significant challenge in personalized medicine. Recently developed REFINED (REpresentation of Features as Images with NEighborhood Dependencies) CNN (Convolutional Neural Network)-based models have shown promising results in improving drug sensitivity prediction. The primary idea behind REFINED-CNN is representing high dimensional vectors as compact images with spatial correlations that can benefit from CNN architectures. However, the mapping from a high dimensional vector to a compact 2D image depends on the a priori choice of the distance metric and projection scheme with limited empirical procedures guiding these choices. Results In this article, we consider an ensemble of REFINED-CNN built under different choices of distance metrics and/or projection schemes that can improve upon a single projection based REFINED-CNN model. Results, illustrated using NCI60 and NCI-ALMANAC databases, demonstrate that the ensemble approaches can provide significant improvement in prediction performance as compared to individual models. We also develop the theoretical framework for combining different distance metrics to arrive at a single 2D mapping. Results demonstrated that distance-averaged REFINED-CNN produced comparable performance as obtained from stacking REFINED-CNN ensemble but with significantly lower computational cost. Availability and implementation The source code, scripts,more »and data used in the paper have been deposited in GitHub (https://github.com/omidbazgirTTU/IntegratedREFINED). Supplementary information Supplementary data are available at Bioinformatics online.« less
  2. Abstract Motivation Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. Results We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug–drug interaction (DDI) prediction, protein–protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performancemore »without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks. Availability and implementation As part of our contributions in the paper, we develop an easy-to-use Python package with detailed instructions, BioNEV, available at: https://github.com/xiangyue9607/BioNEV, including all source code and datasets, to facilitate studying various graph embedding methods on biomedical tasks. Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Abstract Motivation

    Properties of molecules are indicative of their functions and thus are useful in many applications. With the advances of deep-learning methods, computational approaches for predicting molecular properties are gaining increasing momentum. However, there lacks customized and advanced methods and comprehensive tools for this task currently.

    Results

    Here, we develop a suite of comprehensive machine-learning methods and tools spanning different computational models, molecular representations and loss functions for molecular property prediction and drug discovery. Specifically, we represent molecules as both graphs and sequences. Built on these representations, we develop novel deep models for learning from molecular graphs and sequences. In order to learn effectively from highly imbalanced datasets, we develop advanced loss functions that optimize areas under precision–recall curves (PRCs) and receiver operating characteristic (ROC) curves. Altogether, our work not only serves as a comprehensive tool, but also contributes toward developing novel and advanced graph and sequence-learning methodologies. Results on both online and offline antibiotics discovery and molecular property prediction tasks show that our methods achieve consistent improvements over prior methods. In particular, our methods achieve #1 ranking in terms of both ROC-AUC (area under curve) and PRC-AUC on the AI Cures open challenge for drug discovery related to COVID-19.

    Availability<a href='#' onclick='$(this).hide().next().show().next().show();return false;' style='margin-left:10px;'>more »</a><span style='display:none;'>and implementation

    Our source code is released as part of the MoleculeX library (https://github.com/divelab/MoleculeX) under AdvProp.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  4. Abstract Motivation

    Mitochondria are an essential organelle in most eukaryotes. They not only play an important role in energy metabolism but also take part in many critical cytopathological processes. Abnormal mitochondria can trigger a series of human diseases, such as Parkinson's disease, multifactor disorder and Type-II diabetes. Protein submitochondrial localization enables the understanding of protein function in studying disease pathogenesis and drug design.

    Results

    We proposed a new method, SubMito-XGBoost, for protein submitochondrial localization prediction. Three steps are included: (i) the g-gap dipeptide composition (g-gap DC), pseudo-amino acid composition (PseAAC), auto-correlation function (ACF) and Bi-gram position-specific scoring matrix (Bi-gram PSSM) are employed to extract protein sequence features, (ii) Synthetic Minority Oversampling Technique (SMOTE) is used to balance samples, and the ReliefF algorithm is applied for feature selection and (iii) the obtained feature vectors are fed into XGBoost to predict protein submitochondrial locations. SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods. The prediction accuracies of the SubMito-XGBoost method on the two training datasets M317 and M983 were 97.7% and 98.9%, which are 2.8–12.5% and 3.8–9.9% higher than other methods, respectively. The prediction accuracy of the independent test set M495 was 94.8%, which is significantly better than themore »existing studies. The proposed method also achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets. SubMito-XGBoost also plays an important role in new drug design for the treatment of related diseases.

    Availability and implementation

    The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/SubMito-XGBoost/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  5. Abstract Motivation

    Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution.

    Results

    We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates themore »need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings.

    Availability and implementation

    NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less