skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 17, 2026

Title: Cross-species modeling of plant genomes at single-nucleotide resolution using a pretrained DNA language model
Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeledArabidopsisdata for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in bothArabidopsisand maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications.  more » « less
Award ID(s):
2145577
PAR ID:
10618170
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
PNAS
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
122
Issue:
24
ISSN:
0027-8424
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Adaptation to novel environments requires genetic variation, but whether adaptation typically acts upon preexisting genetic variation or must wait for new mutations remains a fundamental question in evolutionary biology. Selection during domestication has been long used as a model to understand evolutionary processes, providing information not only on the phenotypes selected but also, in many cases, an understanding of the causal loci. For each of the causal loci that have been identified in maize, the selected allele can be found segregating in natural populations, consistent with their origin as standing genetic variation. The sole exception to this pattern is the well-characterized domestication locustga1(teosinte glume architecture1), which has long been thought to be an example of selection on a de novo mutation. Here, we use a large dataset of maize and teosinte genomes to reconstruct the origin and evolutionary history oftga1. We first estimated the age oftga1-maizeusing a genealogy-based method, finding that the allele arose approximately 42,000 to 49,000 y ago, predating the beginning of maize domestication. We also identifytga1-maizein teosinte populations, indicating that the allele can survive in the wild. Finally, we compare observed patterns of haplotype structure and mutational age distributions neartga1with simulations, finding that patterns neartga1in maize better resemble those generated under simulated selective sweeps on standing variation. These multiple lines of evidence suggest that maize domestication likely drew upon standing genetic variation attga1and cement the importance of standing variation in driving adaptation during domestication. 
    more » « less
  2. The evolutionary histories of many polyploid plant species are difficult to resolve due to a complex interplay of hybridization, incomplete lineage sorting, and missing diploid progenitors. In the case of octoploid strawberry with four subgenomes designated ABCD, the identities of the diploid progenitors for subgenomes C and D have been subject to much debate. By integrating new sequencing data from North American diploids with reticulate phylogeny and admixture analyses, we uncovered introgression from an extinct or unsampled species in the clade ofFragaria viridis,Fragaria nipponica, andFragaria nilgerrensisinto the donor of subgenome A of octoploidFragariaprior to its divergence fromF. vescasubsp. bracteata. We also detected an introgression event fromF. iinumaeinto an ancestor ofF. nipponicaandF. nilgerrensis.Using an LTR-age-distribution-based approach, we estimate that the octoploid and its intermediate hexaploid and tetraploid ancestors emerged approximately 0.8, 2, and 3 million years ago, respectively. These results provide an explanation for previous reports ofF. viridisandF. nipponicaas donors of the C and D subgenomes and suggest a greater role than previously thought for homoploid hybridization in the diploid progenitors of octoploid strawberry. The integrated set of approaches used here can help advance polyploid genome analysis in other species where hybridization and incomplete lineage sorting obscure evolutionary relationships. 
    more » « less
  3. Komeili, Arash (Ed.)
    ABSTRACT Multipartite bacterial genome organization can confer advantages, including coordinated gene regulation and faster genome replication, but is challenging to maintain.Agrobacterium tumefacienslineages often contain a circular chromosome (Ch1), a linear chromosome (Ch2), and multiple plasmids. We previously observed that in some stocks of the C58 lab model, Ch1 and Ch2 were fused into a linear dicentric chromosome. Here we analyzedAgrobacteriumnatural isolates from the French Collection for Plant-Associated Bacteria and identified two strains distinct from C58 with fused chromosomes. Chromosome conformation capture identified integration junctions that were different from the C58 fusion strain. Genome-wide DNA replication profiling showed that both replication origins remained active. Transposon sequencing revealed that partitioning systems of both chromosome centromeres were essential. Importantly, the site-specific recombinase XerCD is required for the survival of the strains containing the fusion chromosome. Our findings show that replicon fusion occurs in natural environments and that balanced replication arm sizes and proper resolution systems enable the survival of such strains. IMPORTANCEMost bacterial genomes are monopartite with a single, circular chromosome. However, some species, likeAgrobacterium tumefaciens, carry multiple chromosomes. Emergence of multipartite genomes is often related to adaptation to specific niches, including pathogenesis or symbiosis. Multipartite genomes confer certain advantages; however, maintaining this complex structure can present significant challenges. We previously reported a laboratory-propagated lineage ofA. tumefaciensstrain C58 in which the circular and linear chromosomes fused to form a single dicentric chromosome. Here we discovered two geographically separated environmental isolates ofA. tumefacienscontaining fused chromosomes with integration junctions different from the C58 fusion chromosome, revealing the constraints and diversification of this process. We found that balanced replication arm sizes and the repurposing of multimer resolution systems enable the survival and stable maintenance of dicentric chromosomes. These findings reveal how multipartite genomes function across different bacterial species and the role of genomic plasticity in bacterial genetic diversification. 
    more » « less
  4. Many parasitic insects, including lice, form close relationships with endosymbiotic bacteria that are crucial for their survival. In this study, we used genomic sequencing to investigate the distribution and evolutionary history of the bacterial genusSodalisacross a broad range of feather louse species spanning 140 genera. Phylogenomic analysis revealed significant diversity amongSodalislineages in feather lice and robust evidence for their independent and repeated acquisition by different louse clades throughout their radiation. Among the 1020 louse genomes analysed, at least 22% containedSodalis, distributed across 57 louse genera. Cophylogenetic analyses between theSodalisand feather louse phylogenies indicated considerable mismatch. This phylogenetic incongruence between lice andSodalis, along with the presence of distantly relatedSodalislineages in otherwise closely related louse species, strongly indicates repeated independent acquisition of this endosymbiont. Additionally, evidence of cospeciation among a few closely related louse species, coupled with frequent acquisition of these endosymbionts from free-living bacteria, further highlights the diverse evolutionary processes shapingSodalisendosymbiosis in feather lice. 
    more » « less
  5. Ciliates are a model lineage for studies of genome architecture given their unusual genome structures. All ciliates have both somatic macronuclei (MAC) and germline micronuclei (MIC), both of which develop from a zygotic nucleus following sex (i.e., conjugation). Nuclear developmental stages are not well documented among non-model ciliates, includingChilodonella uncinata(class Phyllopharyngea), the focus of our work. Here, we characterize nuclear architecture and genome dynamics inC. uncinataby combining 4′,6-diamidino-2-phenylindole (DAPI) staining and fluorescencein situhybridization (FISH) techniques with confocal microscopy. We developed a telomere probe for staining, which alongside DAPI allows for the identification of fragmented somatic chromosomes among the total DNA in the nuclei. We quantify both total DNA and telomere-bound signals from more than 250 nuclei sampled from 116 individual cells, and analyze changes in DNA content and nuclear architecture acrossChilodonella’s nuclear life cycle. Specifically, we find that MAC developmental stages in the ciliateC. uncinataare different from those reported from other ciliate species. These data provide insights into nuclear dynamics during development and enrich our understanding of genome evolution in non-model ciliates. IMPORTANCECiliates are a clade of diverse single-celled eukaryotic microorganisms that contain at least one somatic macronucleus (MAC) and germline micronucleus (MIC) within each cell/organism. Ciliates rely on complex genome rearrangements to generate somatic genomes from a zygotic nucleus. However, the development of somatic nuclei has only been documented for a few model ciliate genera, includingParamecium,Tetrahymena, andOxytricha. Here, we study the MAC developmental process in the non-model ciliate,C. uncinata. We analyze both total DNA and the generation of gene-sized somatic chromosomes using a laser scanning confocal microscope to describeC. uncinata’s nuclear life cycle. We show that DNA content changes dramatically during their life cycle and in a manner that differs from previous studies on model ciliates. Our study expands knowledge of genome dynamics in ciliates and among eukaryotes more broadly. 
    more » « less