skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A haplotype-resolved, chromosome-scale genome assembly and annotation for Carya glabra (pignut hickory; Juglandaceae)
Abstract Carya glabra(2n= 4x= 64), also known as pignut hickory, is a widely distributed species in the walnut family (Juglandaceae). Native to the central and eastern United States and southeastern Canada,C. glabraplays an important ecological role as a common upland forest species; it is closely related to several economically valuable nut trees, includingC. illinoinensis(pecan). A deeper understanding of the genetics ofC. glabrais essential for studying its evolutionary history and biology, with potential implications for agricultural improvement of pecan. Here, we present the first nuclear genome assembly and annotation ofC. glabra. The assembly is chromosome-level and phased, representing the first assembled polyploid genome in the genusCarya. A total of 64 pseudochromosomes were assembled and phased into four haplotypes. The haplotype A assembly spans 600.4 Mb, comprises 55.0% repetitive sequences, and contains 30,947 protein-coding genes, with a BUSCO completeness score of 97.7%. Functional annotation assigned 94.3% of haplotype A genes to gene families, and 79.7% and 86.3% of genes were annotated with Gene Ontology terms and protein domains, respectively; 635 putative plant disease resistance genes were found in haplotype A. The other three haplotypes exhibited similarly high-quality annotation metrics. Our genomic analyses also suggest thatC. glabrais an autotetraploid. Comparative genomic analyses revealed high collinearity among the four haplotypes ofC. glabraand the published genomes of three otherCaryaspecies, although structural variation among the genomes of these species was identified. In addition, we provide an improved chloroplast genome assembly and the first mitochondrial genome forC. glabra. Importantly, most members of the research team are undergraduate students; the sequenced individual is located in McCarty Woods, a Conservation Area on the University of Florida campus. This work highlights the value of genome assembly efforts as powerful tools for teaching genomics and supporting conservation initiatives. This first high-quality reference genome forC. glabraprovides a valuable resource for studyingCarya, a genus of significant ecological and economic importance. Article summaryCarya glabra(pignut hickory) is a common upland forest species in North America. This species is a member of the walnut family (Juglandaceae), which includes many economically important nut trees. Here, we present the first nuclear genome assembly and annotation ofC. glabra. The assembly is chromosome-level and phased. The haplotype A assembly contains 30,947 protein-coding genes, with a BUSCO completeness score of 97.7%. Our genomic analyses suggest thatC. glabrais an autopolyploid. We also provide chloroplast and mitochondrial genome assemblies. This nuclear genome provides a valuable resource for studyingCarya, a genus of significant ecological and economic importance.  more » « less
Award ID(s):
1923234
PAR ID:
10662355
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ingvarsson, P (Ed.)
    Abstract Eucalyptus grandis is a hardwood tree used worldwide as pure species or hybrid partner to breed fast-growing plantation forestry crops that serve as feedstocks of timber and lignocellulosic biomass for pulp, paper, biomaterials, and biorefinery products. The current v2.0 genome reference for the species served as the first reference for the genus and has helped drive the development of molecular breeding tools for eucalypts. Using PacBio HiFi long reads and Omni-C proximity ligation sequencing, we produced an improved, haplotype-phased assembly (v4.0) for TAG0014, an early-generation selection of E. grandis. The 2 haplotypes are 571 Mbp (HAP1) and 552 Mbp (HAP2) in size and consist of 37 and 46 contigs scaffolded onto 11 chromosomes (contig N50 of 28.9 and 16.7 Mbp), respectively. These haplotype assemblies are 70–90 Mbp smaller than the diploid v2.0 assembly but capture all except one of the 22 telomeres, suggesting that substantial redundant sequence was included in the previous assembly. A total of 35,929 (HAP1) and 35,583 (HAP2) gene models were annotated, of which 438 and 472 contain long introns (>10 kbp) in gene models previously (v2.0) identified as multiple smaller genes. These and other improvements have increased gene annotation completeness levels from 93.8 to 99.4% in the v4.0 assembly. We found that 6,493 and 6,346 genes are within tandem duplicate arrays (HAP1 and HAP2, respectively, 18.4 and 17.8% of the total) and >43.8% of the haplotype assemblies consists of repeat elements. Analysis of synteny between the haplotypes and the E. grandis v2.0 reference genome revealed extensive regions of collinearity, but also some major rearrangements, and provided a preview of population and pangenome variation in the species. 
    more » « less
  2. McIntyre, L (Ed.)
    Abstract Genome sequencing for agriculturally important Rosaceous crops has made rapid progress both in completeness and annotation quality. Whole genome sequence and annotation give breeders, researchers, and growers information about cultivar-specific traits such as fruit quality and disease resistance, and inform strategies to enhance postharvest storage. Here we present a haplotype-phased, chromosomal-level genome of Malus domestica, ‘WA 38’, a new apple cultivar released to market in 2017 as Cosmic Crisp®. Using both short and long-read sequencing data with a k-mer-based approach, chromosomes originating from each parent were assembled and segregated. This is the first pome fruit genome fully phased into parental haplotypes in which chromosomes from each parent are identified and separated into their unique, respective haplomes. The two haplome assemblies, ‘Honeycrisp’ originated HapA and ‘Enterprise’ originated HapB, are about 650 Megabases each, and both have a BUSCO score of 98.7% complete. A total of 53,028 and 54,235 genes were annotated from HapA and HapB, respectively. Additionally, we provide genome-scale comparisons to ‘Gala’, ‘Honeycrisp’, and other relevant cultivars highlighting major differences in genome structure and gene family circumscription. This assembly and annotation was done in collaboration with the American Campus Tree Genomes project that includes ‘WA 38’ (Washington State University), ‘d’Anjou’ pear (Auburn University), and many more. To ensure transparency, reproducibility, and applicability for any genome project, our genome assembly and annotation workflow is recorded in detail and shared under a public GitLab repository. All software is containerized, offering a simple implementation of the workflow. 
    more » « less
  3. Ingvarsson, Pär (Ed.)
    Abstract Cultivated pear consists of several Pyrus species with P. communis (European pear) representing a large fraction of worldwide production. As a relatively recently domesticated crop and perennial tree, pear can benefit from genome-assisted breeding. Additionally, comparative genomics within Rosaceae promises greater understanding of evolution within this economically important family. Here, we generate a fully-phased chromosome-scale genome assembly of P. communis ‘d’Anjou’. Using PacBio HiFi and Dovetail Omni-C reads, the genome is resolved into the expected 17 chromosomes, with each haplotype totalling nearly 540 Megabases and a contig N50 of nearly 14 Mb. Both haplotypes are highly syntenic to each other, and to the Malus domestica ‘Honeycrisp’ apple genome. Nearly 45,000 genes were annotated in each haplotype, over 90% of which have direct RNA-seq expression evidence. We detect signatures of the known whole-genome duplication shared between apple and pear, and we estimate 57% of d’Anjou genes are retained in duplicate derived from this event. This genome highlights the value of generating phased diploid assemblies for recovering the full allelic complement in highly heterozygous crop species. 
    more » « less
  4. CitationSnead, A.A., Meng, F., Largotta, N. et al. Diploid chromosome-level genome assembly and annotation for Lycorma delicatula. Sci Data 12, 579 (2025). https://doi.org/10.1038/s41597-025-04854-8AbstractThe spotted lanternfly (Lycorma delicatula) is a planthopper species (Hemiptera: Fulgoridae) native to China but invasive in South Korea, Japan, and the United States where it is a significant threat to agriculture. Hence, genomic resources are critical to both management and understand the genomic characteristics of successful invaders. Here, we report a haplotype-phased genome assembly and annotation using PacBio long-read sequencing, Hi-C technology, and RNA-seq data. The 2.2 Gbp genome comprises 13 chromosomes, and our whole genome sequencing of eighty-two adults indicated chromosome four as the sex chromosome and anXO sex-determination system.We identified over 12,000 protein coding genes and performed functional annotation, facilitating identification of several candidate genes which may hold importance for spotted lanternfly control. Both the assemblies and annotations were highly complete with over 96% of BUSCO genes complete regardless of the database employed (i.e., Eukaryota, Arthropoda, Insecta). This reference-quality genome will serve as an important resource for both development and optimization of management practices for the spotted lanternfly and invasive genomics as a whole.Description of the data and file structureThis dataset contains the haplotype-phased chromosome-level genome assembly of the spotted lanternfly (Lycorma delicatula) described and published in Snead & Meng et al. (in review). The genome combined long-read data and HiC data (SRA31402152-SRA31402153) to assembly and scaffold each haplotype. The annotation uses RNAseq data from 12 adults (SRA31411873-SRA31411894) to structurally annotate both haplotypes. Finally, whole-genome sequencing of 82 adult spotted lanternfly (bioproject PRJNA1136004) described in the metadata csv provided was used to identify punitive sex chromosomes. The dataset also include GO results for each chromosome not explicitly described in the results of the manuscript.Files and variablesFile: SLF_Hap1.fastaDescription: A fasta file of the assembled genome for the cleaned 13 chromosome haplotype 1 assembly.File: SLF_Hap2.fastaDescription: A fasta file of the assembled genome for the cleaned 13 chromosome haplotype 2 assembly.File: SLF_Hap1_Repeats.gffDescription: A gff file of the repeats annotated in the cleaned 13 chromosome haplotype 1 assembly.File: SLF_Hap2_Repeats.gffDescription: A gff file of the repeats annotated in the cleaned 13 chromosome haplotype 2 assembly.File: SLF_Hap1.gffDescription: A structural annotation of the 13 chromosome haplotype 1 assembly with functional annotations.File: SLF_Hap2.gffDescription: A structural annotation of the 13 chromosome haplotype 2 assembly with functional annotations.File: GO_plot_chr_1.pngDescription: An image of the top 20 GO term results for chromosome 1.File: GO_plot_chr_2.pngDescription: An image of the top 20 GO term results for chromosome 2.File: GO_plot_chr_3.pngDescription: An image of the top 20 GO term results for chromosome 3.File: GO_plot_chr_8.pngDescription: An image of the top 20 GO term results for chromosome 8.File: GO_plot_chr_5.pngDescription: An image of the top 20 GO term results for chromosome 5.File: GO_plot_chr_4.pngDescription: An image of the top 20 GO term results for chromosome 4.File: GO_plot_chr_6.pngDescription: An image of the top 20 GO term results for chromosome 6.File: GO_plot_chr_7.pngDescription: An image of the top 20 GO term results for chromosome 7.File: GO_plot_chr_11.pngDescription: An image of the top 20 GO term results for chromosome 11.File: GO_plot_chr_9.pngDescription: An image of the top 20 GO term results for chromosome 9.File: GO_plot_chr_10.pngDescription: An image of the top 20 GO term results for chromosome 10.File: GO_plot_chr_12.pngDescription: An image of the top 20 GO term results for chromosome 12.File: GO_plot_chr_13.pngDescription: An image of the top 20 GO term results for chromosome 13.File: SLF_Samples_SRA.csvDescription: A csv with the sequencing information, SRA numbers, and sexes of the adults used in to identify the putative sex chromosome.File: SLF_RNAseq_Metadata.csvDescription: A csv with the sequencing information, SRA numbers, and other metadata for the RNAseq used to annotation the genomes.Variablesaccession: The SRA accession numberstudy: The studyobject_status: If the NCBI submission was new or not.bioproject_accession: The bioproject accession numberbiosample_accession: The Biosample accession numberlibrary_ID: The ID used to identify that genomic library.title: The title of the study (the bioproject)library_strategy: Specific sequencing technique used to prepare the library.library_source: The biological material used to create the sequencing library.library_selection: The library preparation method.library_layout: The arrangement of reads within the sequencing library.platform: The sequencing platform.instrument_model: The model of the sequences.design_description: Description of the study design.filetype: Type of filefilename: First filefilename2: Second filesex: The biological sex of the adult.Code/softwareThe initial haplotype-phased scaffolded genome was assembled by Dovetail Genomics (Cantata Bio) with standard software outlined in the methods with default settings. Scripts for the remaining work including annotation, gene ontology enrichment, and other analyses are located in the Github repository (https://github.com/anthonysnead/SLF-Genome-Assembly(opens in new window)).Access informationOther publicly accessible locations of the data:The raw sequencing data and the annotated haplotype-phased genome assembly of Lycorma delicatula have been deposited at the National Center for Biotechnology Information (NCBI). The Hi-C and HiFi data can be found under SRA31402152 and SRA31402153. The RNA-seq data can be found under SRA31411873-SRA31411894, while the DNA-seq data can be found under bioproject PRJNA1136004. 
    more » « less
  5. Abstract Phased genomes and pangenomes are enhancing our understanding of genetic variation. Accurate phasing and assembly in repetitive regions of the genome remain challenging, however. Addressing this obstacle is crucial for studying structural genomic variation, such as copy number variations (CNVs) common to repetitive regions. Polar fishes, for example, evolved repetitive tandem arrays of antifreeze protein (AFP) genes that facilitate adaptation to freezing and expanded in copy number in colder environments. AFP CNVs remain poorly characterized in polar fishes and may be illuminated by haplotype-aware approaches. We performed long-read sequencing for two polar fishes in the suborder Zoarcoidei and leveraged additional published long-read data to assemble phased genomes. We developed a workflow to measure haplotype diversity in CNV while controlling for misassembly and switch errors—a change from one parental haplotype to another in a contiguous assembly. We presentgfa_parser, which computes and extracts all possible contiguous sequences for phased or primary assemblies from graphical fragment assembly (GFA) files, andswitch_error_screen, which flags potential switch errors.gfa_parserrevealed that assembly uncertainty was ubiquitous across AFP array haplotypes and that standard processing of graphical fragment assemblies can bias measurement of haplotype CNVs. We detected no switch errors in AFP arrays. After controlling for misassembly and switch error, we detected haplotype diversity of AFP CNVs in all studied polar Zoarcoidei species and in 60% of AFP arrays. Intraindividual haplotype diversity spanned differences of 3–16 copies. Our workflow revealed intraspecific genomic diversity in zoarcoids that likely fueled the evolution of AFP copy number across temperature. 
    more » « less