Abstract Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins,β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
more »
« less
Acceleration in Low-Rank Tensor Completion
Generative AI is generating much enthusiasm on potentially advancing biological design in computational biology. In this paper we take a somewhat contrarian view, arguing that a broader and deeper understanding of existing biological sequences is essential before undertaking the design of novel ones. We draw attention, for instance, to current protein function prediction methods which currently face significant limitations due to incomplete data and inherent challenges in defining and measuring function. We propose a “blue sky” vision centered on both comprehensive and precise annotation of existing protein and DNA sequences, aiming to develop a more complete and precise understanding of biological function. By contrasting recent studies that leverage generative AI for biological design with the pressing need for enhanced data annotation, we underscore the importance of prioritizing robust predictive models over premature generative efforts. We advocate for a strategic shift toward thorough sequence annotation and predictive understanding, laying a solid foundation for future advances in biological design.
more »
« less
- Award ID(s):
- 2310113
- PAR ID:
- 10612837
- Publisher / Repository:
- Society for Industrial and Applied Mathematics
- Date Published:
- ISBN:
- 978-1-61197-852-0
- Page Range / eLocation ID:
- 31 to 40
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Protein‐ligand binding is a fundamental biological process that is paramount to many other biological processes, such as signal transduction, metabolic pathways, enzyme construction, cell secretion, and gene expression. Accurate prediction of protein‐ligand binding affinities is vital to rational drug design and the understanding of protein‐ligand binding and binding induced function. Existing binding affinity prediction methods are inundated with geometric detail and involve excessively high dimensions, which undermines their predictive power for massive binding data. Topology provides the ultimate level of abstraction and thus incurs too much reduction in geometric information. Persistent homology embeds geometric information into topological invariants and bridges the gap between complex geometry and abstract topology. However, it oversimplifies biological information. This work introduces element specific persistent homology (ESPH) or multicomponent persistent homology to retain crucial biological information during topological simplification. The combination of ESPH and machine learning gives rise to a powerful paradigm for macromolecular analysis. Tests on 2 large data sets indicate that the proposed topology‐based machine‐learning paradigm outperforms other existing methods in protein‐ligand binding affinity predictions. ESPH reveals protein‐ligand binding mechanism that can not be attained from other conventional techniques. The present approach reveals that protein‐ligand hydrophobic interactions are extended to 40Å away from the binding site, which has a significant ramification to drug and protein design.more » « less
-
null (Ed.)It is critical for biological studies to annotate amino acid sequences and understand how proteins function. Protein function is important to medical research in the health industry (e.g., drug discovery). With the advancement of deep learning, accurate protein annotation models have been developed for alignment free protein annotation. In this paper, we develop a deep learning model with an attention mechanism that can predict Gene Ontology labels given a protein sequence input. We believe this model can produce accurate predictions as well as maintain good interpretability. We further show how the model can be interpreted by examining and visualizing the intermediate layer output in our deep neural network.more » « less
-
Abstract MotivationAs fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. ResultsWe introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability and implementationhttps://github.com/BioinfoMachineLearning/TransFew.more » « less
-
Seppänen, O; Koskela, L; Murata, K (Ed.)Generative Artificial Intelligence (AI), specifically Large Language Models (LLMs), has the potential to reshape construction management practices. These novel solutions can transform lean construction by enabling real-time data analysis, streamlined communication, and automated decision-making across project teams. They can facilitate enhanced collaboration by generating insights from vast construction data sets, improving workflow efficiency, and reducing waste. Additionally, LLMs can support predictive modeling, proactive risk management, and knowledge sharing, aligning with lean principles of maximizing value and minimizing inefficiencies. Given the recent advancements in generative AI, it is critical to systematically shape future research directions by building on the existing body of knowledge and addressing key knowledge gaps. The first step toward identifying knowledge gaps and uncovering critical areas that remain underexplored is to systematically analyze the existing body of knowledge. Therefore, this study conducts a scoping review to synthesize the extent, range, and nature of existing studies that have proposed novel solutions using generative AI and LLMs for various aspects of construction management. The outcomes of this systematic scoping review will help identify potential research directions for future studies in this domain.more » « less
An official website of the United States government

