skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 14, 2026

Title: seqLens : optimizing language models for genomic predictions
Understanding genomic sequences through the lens of language modeling has the potential to revolutionize biological research, yet challenges in tokenization, model architecture, and adaptation to diverse genomic contexts remain. In this study, we investigated key innovations in DNA sequence modeling, treating DNA as a language and applying language models to genomic data. We gathered two diverse pretraining datasets: one consisting of 19,551 reference genomes, including over 18,000 prokaryotic genomes (115B nucleotides), and another more balanced dataset with 1,354 genomes, including 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides). We trained five byte-pair encoding tokenizers and pretrained 52 DNA language models, systematically comparing different architectures, hyperparameters, and classification heads. We introduceseqLens, a family of models based on disentangled attention with relative positional encoding, which outperforms state-of-the-art models in 13 of 19 benchmarking phenotypic predictions. We further explore continual pretraining, domain adaptation, and parameter-efficient fine-tuning methods to assess trade-offs between computational efficiency and accuracy. Our findings demonstrate that relevant pretraining data significantly boosts performance, alternative pooling techniques enhance classification, and larger tokenizers negatively impact generalization. These insights provide a foundation for optimizing DNA language models and improving genome annotations.  more » « less
Award ID(s):
2109688
PAR ID:
10627421
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. Alanio, Alexandre (Ed.)
    ABSTRACT Modern taxonomic classification is often based on phylogenetic analyses of a few molecular markers, although single-gene studies are still common. Here, we leverage genome-scale molecular phylogenetics (phylogenomics) of species and populations to reconstruct evolutionary relationships in a dense data set of 710 fungal genomes from the biomedically and technologically important genusAspergillus. To do so, we generated a novel set of 1,362 high-quality molecular markers specific forAspergillusand provided profile Hidden Markov Models for each, facilitating their use by others. Examining the resulting phylogeny helped resolve ongoing taxonomic controversies, identified new ones, and revealed extensive strain misidentification (7.59% of strains were previously misidentified), underscoring the importance of population-level sampling in species classification. These findings were corroborated using the current standard, taxonomically informative loci. These findings suggest that phylogenomics of species and populations can facilitate accurate taxonomic classifications and reconstructions of the Tree of Life.IMPORTANCEIdentification of fungal species relies on the use of molecular markers. Advances in genomic technologies have made it possible to sequence the genome of any fungal strain, making it possible to use genomic data for the accurate assignment of strains to fungal species (and for the discovery of new ones). We examined the usefulness and current limitations of genomic data using a large data set of 710 publicly available genomes from multiple strains and species of the biomedically, agriculturally, and industrially important genusAspergillus. Our evolutionary genomic analyses revealed that nearly 8% of publicly availableAspergillusgenomes are misidentified. Our work highlights the usefulness of genomic data for fungal systematic biology and suggests that systematic genome sequencing of multiple strains, including reference strains (e.g., type strains), of fungal species will be required to reduce misidentification errors in public databases. 
    more » « less
  2. Malware poses an increasing threat to critical computing infrastructure, driving demand for more advanced detection and analysis methods. Although raw-binary malware classifiers show promise, they are limited in their capabilities and struggle with the challenges of modeling long sequences. Meanwhile, the rise of large language models (LLMs) in natural language processing showcases the power of massive, self-supervised models trained on heterogeneous datasets, offering flexible representations for numerous downstream tasks. The success behind these models is rooted in the size and quality of their training data, the expressiveness and scalability of their neural architecture, and their ability to learn from unlabeled data in a self-supervised manner. In this work, we take the first steps toward developing large malware language models (LMLMs), the malware analog to LLMs. We tackle the core aspects of this objective, namely, questions about data, models, pretraining, and finetuning. By pretraining a malware classification model with language modeling objectives, we were able to improve downstream performance on diverse practical malware classification tasks on average by 1.1% and up to 28.6%, indicating that these models could serve to succeed raw-binary malware classifiers. 
    more » « less
  3. Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pretrained on large-scale biological sequences can capture evolutionary conservation and offer cross-species prediction better than supervised models through fine-tuning limited labeled data. We introduce PlantCaduceus, a plant DNA LM that learns evolutionary conservation patterns in 16 angiosperm genomes by modeling both DNA strands simultaneously. When fine-tuned on a small set of labeledArabidopsisdata for tasks such as predicting translation initiation/termination sites and splice donor/acceptor sites, PlantCaduceus demonstrated remarkable transferability to maize, which diverged 160 Mya. The model outperformed the best existing DNA language model by 1.45-fold in maize splice donor prediction and 7.23-fold in maize translation initiation site prediction. In variant effect prediction, PlantCaduceus showed performance comparative to state-of-the-art protein LMs. Mutations predicted to be deleterious by PlantCaduceus showed threefold lower average minor allele frequencies compared to those identified by multiple sequence alignment-based methods. Additionally, PlantCaduceus successfully identifies well-known causal variants in bothArabidopsisand maize. Overall, PlantCaduceus is a versatile DNA LM that can accelerate plant genomics and crop breeding applications. 
    more » « less
  4. Abstract Viruses of the phylumNucleocytoviricota, often referred to as “giant viruses,” are prevalent in various environments around the globe and play significant roles in shaping eukaryotic diversity and activities in global ecosystems. Given the extensive phylogenetic diversity within this viral group and the highly complex composition of their genomes, taxonomic classification of giant viruses, particularly incomplete metagenome-assembled genomes (MAGs) can present a considerable challenge. Here we developed TIGTOG (TaxonomicInformation ofGiant viruses usingTrademarkOrthologousGroups), a machine learning-based approach to predict the taxonomic classification of novel giant virus MAGs based on profiles of protein family content. We applied a random forest algorithm to a training set of 1531 quality-checked, phylogenetically diverseNucleocytoviricotagenomes using pre-selected sets of giant virus orthologous groups (GVOGs). The classification models were predictive of viral taxonomic assignments with a cross-validation accuracy of 99.6% at the order level and 97.3% at the family level. We found that no individual GVOGs or genome features significantly influenced the algorithm’s performance or the models’ predictions, indicating that classification predictions were based on a comprehensive genomic signature, which reduced the necessity of a fixed set of marker genes for taxonomic assigning purposes. Our classification models were validated with an independent test set of 823 giant virus genomes with varied genomic completeness and taxonomy and demonstrated an accuracy of 98.6% and 95.9% at the order and family level, respectively. Our results indicate that protein family profiles can be used to accurately classify large DNA viruses at different taxonomic levels and provide a fast and accurate method for the classification of giant viruses. This approach could easily be adapted to other viral groups. 
    more » « less
  5. Abstract Rice is a vital staple crop, sustaining over half of the global population, and is a key model for genetic research. To support the growing need for comprehensive and accessible rice genomic data, GrameneOryza (https://oryza.gramene.org) was developed as an online resource adhering to FAIR (Findable, Accessible, Interoperable, and Reusable) principles of data management. It distinguishes itself through its comprehensive multispecies focus, encompassing a wide variety of Oryza genomes and related species, and its integration with FAIR principles to ensure data accessibility and usability. It offers a community curated selection of high-quality Oryza genomes, genetic variation, gene function, and trait data. The latest release, version 8, includes 28 Oryza genomes, covering wild rice and domesticated cultivars. These genomes, along with Leersia perrieri and seven additional outgroup species, form the basis for 38 K protein-coding gene family trees, essential for identifying orthologs, paralogs, and developing pan-gene sets. GrameneOryza’s genetic variation data features 66 million single-nucleotide variants (SNVs) anchored to the Os-Nipponbare-Reference-IRGSP-1.0 genome, derived from various studies, including the Rice Genome 3 K (RG3K) project. The RG3K sequence reads were also mapped to seven additional platinum-quality Asian rice genomes, resulting in 19 million SNVs for each genome, significantly expanding the coverage of genetic variation beyond the Nipponbare reference. Of the 66 million SNVs on IRGSP-1.0, 27 million acquired standardized reference SNP cluster identifiers (rsIDs) from the European Variation Archive release v5. Additionally, 1200 distinct phenotypes provide a comprehensive overview of quantitative trait loci (QTL) features. The newly introduced Oryza CLIMtools portal offers insights into environmental impacts on genome adaptation. The platform’s integrated search interface, along with a BLAST server and curation tools, facilitates user access to genomic, phylogenetic, gene function, and QTL data, supporting broad research applications. Database URL: https://oryza.gramene.org 
    more » « less