Generative AI is generating much enthusiasm on potentially advancing biological design in computational biology. In this paper we take a somewhat contrarian view, arguing that a broader and deeper understanding of existing biological sequences is essential before undertaking the design of novel ones. We draw attention, for instance, to current protein function prediction methods which currently face significant limitations due to incomplete data and inherent challenges in defining and measuring function. We propose a “blue sky” vision centered on both comprehensive and precise annotation of existing protein and DNA sequences, aiming to develop a more complete and precise understanding of biological function. By contrasting recent studies that leverage generative AI for biological design with the pressing need for enhanced data annotation, we underscore the importance of prioritizing robust predictive models over premature generative efforts. We advocate for a strategic shift toward thorough sequence annotation and predictive understanding, laying a solid foundation for future advances in biological design.
more »
« less
Foundation Models for AI-enabled Biological Design
This paper surveys foundation models for AI-enabled biological design, focusing on recent developments in applying large-scale, self-supervised models to tasks such as protein engineering, small molecule design, and genomic sequence design. Though this domain is evolving rapidly, this survey presents and discusses a taxonomy of current models and methods. The focus is on challenges and solutions in adapting these models for biological applications, including biological sequence modeling architectures, controllability in generation, and multi-modal integration. The survey concludes with a discussion of open problems and future directions, offering concrete next-steps to improve the quality of biological sequence generation.
more »
« less
- Award ID(s):
- 2310113
- PAR ID:
- 10612836
- Publisher / Repository:
- AAAI 2025 FMs4Bio Workshop
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Gaglia, Marta M. (Ed.)ABSTRACT Next-generation sequencing has been essential to the global response to the COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available to researchers in public databases. Sequence databases are an abundant resource from which to extract biologically relevant and clinically actionable information. As the pandemic has gone on, SARS-CoV-2 has rapidly evolved, involving complex genomic changes that challenge current approaches to classifying SARS-CoV-2 variants. Deep sequence learning could be a potentially powerful way to build complex sequence-to-phenotype models. Unfortunately, while they can be predictive, deep learning typically produces “black box” models that cannot directly provide biological and clinical insight. Researchers should therefore consider implementing emerging methods for visualizing and interpreting deep sequence models. Finally, researchers should address important data limitations, including (i) global sequencing disparities, (ii) insufficient sequence metadata, and (iii) screening artifacts due to poor sequence quality control.more » « less
-
Computational design of functional proteins is of both fundamental and applied interest. This study introduces a generative framework for co‐designing protein sequence and structure in a unified process by modeling their joint distribution, with the goal of enabling cross‐modality interactions toward coherent and functional designs. Each residue is represented by three distinct modalities (type, position, and orientation) and modeled using dedicated diffusion processes: multinomial for types, Cartesian for positions, and special orthogonal group SO(3) for orientations. To couple these modalities, we propose a unified architecture, ReverseNet, which employs a shared graph attention encoder to integrate multimodal information and separate projectors to predict each modality. We benchmark our models, JointDiff and JointDiff‐x, on unconditional monomer design and conditional motif scaffolding tasks. Compared to two‐stage design models that generate sequence and structure separately, our models produce monomer structures with comparable or better designability, while currently lagging in sequence quality and motif scaffolding performance based on computational metrics. However, they are 1–2 orders of magnitude faster and support rapid iterative improvements through classifier‐guided sampling. To complement computational evaluations, we experimentally validate our approach through a case study on green fluorescent protein (GFP) design. Several novel, evolutionarily distant variants generated by our models exhibit measurable fluorescence, confirming functional activity. These results demonstrate the feasibility of joint sequence–structure generation and establish a foundation to accelerate functional protein design in future applications. Codes, data, and trained models are accessible athttps://github.com/Shen-Lab/JointDiff.more » « less
-
In recent times, sequence-to-sequence (seq2seq) models have gained a lot of popularity and provide stateof-the-art performance in a wide variety of tasks, such as machine translation, headline generation, text summarization, speech-to-text conversion, and image caption generation. The underlying framework for all these models is usually a deep neural network comprising an encoder and a decoder. Although simple encoder–decoder models produce competitive results, many researchers have proposed additional improvements over these seq2seq models, e.g., using an attention-based model over the input, pointer-generation models, and self-attention models. However, such seq2seq models suffer from two common problems: 1) exposure bias and 2) inconsistency between train/test measurement. Recently, a completely novel point of view has emerged in addressing these two problems in seq2seq models, leveraging methods from reinforcement learning (RL). In this survey, we consider seq2seq problems from the RL point of view and provide a formulation combining the power of RL methods in decision-making with seq2seq models that enable remembering long-term memories. We present some of the most recent frameworks that combine the concepts from RL and deep neural networks. Our work aims to provide insights into some of the problems that inherently arise with current approaches and how we can address them with better RL models. We also provide the source code for implementing most of the RL models discussed in this paper to support the complex task of abstractive text summarization and provide some targeted experiments for these RL models, both in terms of performance and training time.more » « less
-
Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the token level. There is, therefore, a granularity mismatch between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two minimalist learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks — discrete-prompt generation and text summarization. Source codes are released at https://github.com/Shentao-YANG/Preference_Grounded_Guidance.more » « less
An official website of the United States government

