skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Leveraging Large Language Models for Predicting Microbial Virulence from Protein Structure and Sequence
In the aftermath of COVID-19, screening for pathogens has never been a more relevant problem. However, computational screening for pathogens is challenging due to a variety of factors, including (i) the complexity and role of the host, (ii) virulence factor divergence and dynamics, and (iii) population and community-level dynamics. Considering a potential pathogen's molecular interactions, specifically individual proteins and protein interactions can help pinpoint a potential protein of a given microbe to cause disease. However, existing tools for pathogen screening rely on existing annotations (KEGG, GO, etc), making the assessment of novel and unannotated proteins more challenging. Here, we present an LLM-inspired approach that considers protein sequence and structure to predict protein virulence. We present a two-stage model incorporating evolutionary features captured from the DistilProtBert language model and protein structure in a graph convolutional network. Our model performs better than sequence alone for virulence function when high-quality structures are present, thus representing a path forward for virulence prediction of novel and unannotated proteins.  more » « less
Award ID(s):
2239114
PAR ID:
10502775
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
ISBN:
9798400701269
Page Range / eLocation ID:
1 to 6
Format(s):
Medium: X
Location:
Houston TX USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Designing protein-binding proteins is critical for drug discovery. However, artificial-intelligence-based design of such proteins is challenging due to the complexity of protein–ligand interactions, the flexibility of ligand molecules and amino acid side chains, and sequence–structure dependencies. We introduce PocketGen, a deep generative model that produces residue sequence and atomic structure of the protein regions in which ligand interactions occur. PocketGen promotes consistency between protein sequence and structure by using a graph transformer for structural encoding and a sequence refinement module based on a protein language model. The graph transformer captures interactions at multiple scales, including atom, residue and ligand levels. For sequence refinement, PocketGen integrates a structural adapter into the protein language model, ensuring that structure-based predictions align with sequence-based predictions. PocketGen can generate high-fidelity protein pockets with enhanced binding affinity and structural validity. It operates ten times faster than physics-based methods and achieves a 97% success rate, defined as the percentage of generated pockets with higher binding affinity than reference pockets. Additionally, it attains an amino acid recovery rate exceeding 63%. 
    more » « less
  2. Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR- BERT’s ability to use contextual information. 
    more » « less
  3. ABSTRACT The increasing threat of antibiotic resistance underscores the urgent need for innovative strategies to combat infectious diseases, including the development of antivirulants. Microbial pathogens rely on their virulence factors to initiate and sustain infections. Antivirulants are small molecules designed to target virulence factors, thereby attenuating the virulence of infectious microbes. The bacterial type IV pilus (T4P), an extracellular protein filament that depends on the T4P machinery (T4PM) for its biogenesis, dynamics and function, is a key virulence factor in many significant bacterial pathogens. While the T4PM presents a promising antivirulence target, the systematic identification of inhibitors for its multiple protein constituents remains a considerable challenge. Here we report a novel high‐throughput screening (HTS) approach for discovering T4P inhibitors. It usesPseudomonas aeruginosa,a high‐priority pathogen, in combination with its T4P‐targeting phage, φKMV. Screening of a library of 2168 compounds using an optimised protocol led to the identification of tuspetinib, based on its deterrence of the lysis ofP. aeruginosaby φKMV. Our findings show that tuspetinib also inhibits two additional T4P‐targeting phages, while having no effect on a phage that recognises lipopolysaccharides as its receptor. Additionally, tuspetinib impedes T4P‐mediated motility inP. aeruginosaandAcinetobacterspecies without impacting growth or flagellar motility. This bacterium‐phage pairing approach is applicable to a broad range of virulence factors that are required for phage infection, paving ways for the development of advanced chemotherapeutics against antibiotic‐resistant infections. 
    more » « less
  4. Proteins play a central role in biology from immune recognition to brain activity. While major advances in machine learning have improved our ability to predict protein structure from sequence, determining protein function from its sequence or structure remains a major challenge. Here, we introduce holographic convolutional neural network (H-CNN) for proteins, which is a physically motivated machine learning approach to model amino acid preferences in protein structures. H-CNN reflects physical interactions in a protein structure and recapitulates the functional information stored in evolutionary data. H-CNN accurately predicts the impact of mutations on protein stability and binding of protein complexes. Our interpretable computational model for protein structure–function maps could guide design of novel proteins with desired function. 
    more » « less
  5. Skolnick, Jeffrey (Ed.)
    Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space. 
    more » « less