Abstract BackgroundCrop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at high resolution, within fewer than hundreds of base pairs. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we use genomic annotations to accurately predict nucleotide conservation across angiosperms, as a proxy for fitness effect of mutations. ResultsUsing only sequence analysis, we annotate nonsynonymous mutations in 25,824 maize gene models, with information from bioinformatics and deep learning. Our predictions are validated by experimental information: within-species conservation, chromatin accessibility, and gene expression. According to gene ontology and pathway enrichment analyses, predicted nucleotide conservation points to genes in central carbon metabolism. Importantly, it improves genomic prediction for fitness-related traits such as grain yield, in elite maize panels, by stringent prioritization of fewer than 1% of single-site variants. ConclusionsOur results suggest that predicting nucleotide conservation across angiosperms may effectively prioritize sites most likely to impact fitness-related traits in crops, without being limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Our approach—Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC)—could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing. The trained PICNC models and predicted nucleotide conservation at protein-coding SNPs in maize are publicly available in CyVerse (https://doi.org/10.25739/hybz-2957).
more »
« less
This content will become publicly available on November 23, 2026
Predicting Variant Fitness of SARS-COV-2 from Full ViralGenome Sequences
Accurate prediction of the transmission fitness of emerging SARS-CoV-2 variants is vital for timely public health responses. In this study, we present a deep learning framework that predicts variant fitness from raw genomic sequences using a convolutional neural network (CNN) trained to regress Differential Population Growth Rate (DPGR) values. Our approach achieves high predictive accuracy R-square value of 0.92 on genomic sequences sampled from the USA and Europe. To interpret the model’s predictions, we apply SHapley Additive exPlanations (SHAP) to identify nucleotide-level contributions to predicted fitness. Our analysis highlights key mutations in ORF9 (nucleocapsid), ORF2 (spike), ORF5 (membrane), and ORF8 that either enhance or reduce predicted DPGR. Notably, we identify amino acid–altering mutations such as D3L, E484K, N501Y, and V97I as strong positive contributors to fitness, while synonymous or non-coding mutations had more subtle or regulatory effects. These findings validate the potential of sequence-based modeling and interpretable AI to support early detection and prioritization of high-risk variants.
more »
« less
- PAR ID:
- 10649635
- Publisher / Repository:
- Proceedings of the 2025 AAAI Fall Symposium Series
- Date Published:
- Journal Name:
- Proceedings of the AAAI Symposium Series
- Volume:
- 7
- Issue:
- 1
- ISSN:
- 2994-4317
- Page Range / eLocation ID:
- 428 to 437
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochromecto achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY’s potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.more » « less
-
Aye-ayes (Daubentonia madagascariensis) are one of the 25 most critically endangered primate species in the world. Endemic to Madagascar, their small and highly fragmented populations make them particularly vulnerable to both genetic disease and anthropogenic environmental changes. Over the past decade, conservation genomic efforts have largely focused on inferring and monitoring population structure based on single nucleotide variants to identify and protect critical areas of genetic diversity. However, the recent release of a highly contiguous genome assembly allows, for the first time, for the study of structural genomic variation (deletions, duplications, insertions, and inversions) which are likely to impact a substantial proportion of the species’ genome. Based on whole-genome, short-read sequencing data from 14 individuals, >1,000 high-confidence autosomal structural variants were detected, affecting ∼240 kb of the aye-aye genome. The majority of these variants (>85%) were deletions shorter than 200 bp, consistent with the notion that longer structural mutations are often associated with strongly deleterious fitness effects. For example, two deletions longer than 850 bp located within disease-linked genes were predicted to impose substantial fitness deficits owing to a resulting frameshift and gene fusion, respectively; whereas several other major effect variants outside of coding regions are likely to impact gene regulatory landscapes. Taken together, this first glimpse into the landscape of structural variation in aye-ayes will enable future opportunities to advance our understanding of the traits impacting the fitness of this endangered species, as well as allow for enhanced evolutionary comparisons across the full primate clade.more » « less
-
BACKGROUND: Genomic surveillance allows identification of circulating SARS-CoV-2 variants. We provide an update on the evolution of SARS-CoV-2 in Rhode Island (RI). METHODS: All publicly available SARS-CoV-2 RI sequences were retrieved from https://www.gisaid.org. Genomic analyses were conducted to identify variants of concern (VOC), variants being monitored (VBM), or non-VOC/non-VBM, and investigate their evolution. RESULTS: Overall, 17,340 SARS-CoV-2 RI sequences were available between 2/2020–5/2022 across five (globally recognized) major waves, including 1,462 (8%) sequences from 36 non VOC/non-VBM until 5/2021; 10,565 (61%) sequences from 8 VBM between 5/2021–12/2021, most commonly Delta; and 5,313 (31%) sequences from the VOC Omicron from 12/2021 onwards. Genomic analyses demonstrated 71 Delta and 44 Omicron sub-lineages, with occurrence of variant-defining mutations in other variants. CONCLUSION: Statewide SARS-CoV-2 genomic surveillance allows for continued characterization of circulating variants and monitoring of viral evolution, which inform the local health force and guide public health on mitigation efforts against COVID-19. KEYWORDS: COVID-19, SARS-CoV-2, variants, genomic sequencing, Rhode Islandmore » « less
-
Knowledge of the distribution of fitness effects (DFE) of mutations is critical to the understanding of protein evolution. Here, we describe methods for large-scale, systematic measurements of the DFE using growth competition and deep mutational scanning. We discuss techniques for producing comprehensive libraries of gene variants as well as provide necessary considerations for designing these experiments. Using these methods, we have constructed libraries containing over 18,000 variants, measured fitness effects of these mutations by deep mutational scanning, and verified the presence of fitness effects in individual variants. Our methods provide a high-throughput protocol for measuring biological fitness effects of mutations and the dependence of fitness effects on the environment.more » « less
An official website of the United States government
