skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A haplotype-resolved, chromosome-scale genome for Malus domestica Borkh. ‘WA 38’
Abstract Genome sequencing for agriculturally important Rosaceous crops has made rapid progress both in completeness and annotation quality. Whole genome sequence and annotation give breeders, researchers, and growers information about cultivar-specific traits such as fruit quality and disease resistance, and inform strategies to enhance postharvest storage. Here we present a haplotype-phased, chromosomal-level genome of Malus domestica, ‘WA 38’, a new apple cultivar released to market in 2017 as Cosmic Crisp®. Using both short and long-read sequencing data with a k-mer-based approach, chromosomes originating from each parent were assembled and segregated. This is the first pome fruit genome fully phased into parental haplotypes in which chromosomes from each parent are identified and separated into their unique, respective haplomes. The two haplome assemblies, ‘Honeycrisp’ originated HapA and ‘Enterprise’ originated HapB, are about 650 Megabases each, and both have a BUSCO score of 98.7% complete. A total of 53,028 and 54,235 genes were annotated from HapA and HapB, respectively. Additionally, we provide genome-scale comparisons to ‘Gala’, ‘Honeycrisp’, and other relevant cultivars highlighting major differences in genome structure and gene family circumscription. This assembly and annotation was done in collaboration with the American Campus Tree Genomes project that includes ‘WA 38’ (Washington State University), ‘d’Anjou’ pear (Auburn University), and many more. To ensure transparency, reproducibility, and applicability for any genome project, our genome assembly and annotation workflow is recorded in detail and shared under a public GitLab repository. All software is containerized, offering a simple implementation of the workflow.  more » « less
Award ID(s):
2239530
PAR ID:
10569397
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Editor(s):
McIntyre, L
Publisher / Repository:
Oxford
Date Published:
Journal Name:
G3: Genes, Genomes, Genetics
ISSN:
2160-1836
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Suncus etruscusis one of the world’s smallest mammals, with an average body mass of about 2 grams. The Etruscan shrew’s small body is accompanied by a very high energy demand and numerous metabolic adaptations. Here we report a chromosome-level genome assembly using PacBio long read sequencing, 10X Genomics linked short reads, optical mapping, and Hi-C linked reads. The assembly is partially phased, with the 2.472 Gbp primary pseudohaplotype and 1.515 Gbp alternate. We manually curated the primary assembly and identified 22 chromosomes, including X and Y sex chromosomes. The NCBI genome annotation pipeline identified 39,091 genes, 19,819 of them protein-coding. We also identified segmental duplications, inferred GO term annotations, and computed orthologs of human and mouse genes. This reference-quality genome will be an important resource for research on mammalian development, metabolism, and body size control. 
    more » « less
  2. CitationSnead, A.A., Meng, F., Largotta, N. et al. Diploid chromosome-level genome assembly and annotation for Lycorma delicatula. Sci Data 12, 579 (2025). https://doi.org/10.1038/s41597-025-04854-8AbstractThe spotted lanternfly (Lycorma delicatula) is a planthopper species (Hemiptera: Fulgoridae) native to China but invasive in South Korea, Japan, and the United States where it is a significant threat to agriculture. Hence, genomic resources are critical to both management and understand the genomic characteristics of successful invaders. Here, we report a haplotype-phased genome assembly and annotation using PacBio long-read sequencing, Hi-C technology, and RNA-seq data. The 2.2 Gbp genome comprises 13 chromosomes, and our whole genome sequencing of eighty-two adults indicated chromosome four as the sex chromosome and anXO sex-determination system.We identified over 12,000 protein coding genes and performed functional annotation, facilitating identification of several candidate genes which may hold importance for spotted lanternfly control. Both the assemblies and annotations were highly complete with over 96% of BUSCO genes complete regardless of the database employed (i.e., Eukaryota, Arthropoda, Insecta). This reference-quality genome will serve as an important resource for both development and optimization of management practices for the spotted lanternfly and invasive genomics as a whole.Description of the data and file structureThis dataset contains the haplotype-phased chromosome-level genome assembly of the spotted lanternfly (Lycorma delicatula) described and published in Snead & Meng et al. (in review). The genome combined long-read data and HiC data (SRA31402152-SRA31402153) to assembly and scaffold each haplotype. The annotation uses RNAseq data from 12 adults (SRA31411873-SRA31411894) to structurally annotate both haplotypes. Finally, whole-genome sequencing of 82 adult spotted lanternfly (bioproject PRJNA1136004) described in the metadata csv provided was used to identify punitive sex chromosomes. The dataset also include GO results for each chromosome not explicitly described in the results of the manuscript.Files and variablesFile: SLF_Hap1.fastaDescription: A fasta file of the assembled genome for the cleaned 13 chromosome haplotype 1 assembly.File: SLF_Hap2.fastaDescription: A fasta file of the assembled genome for the cleaned 13 chromosome haplotype 2 assembly.File: SLF_Hap1_Repeats.gffDescription: A gff file of the repeats annotated in the cleaned 13 chromosome haplotype 1 assembly.File: SLF_Hap2_Repeats.gffDescription: A gff file of the repeats annotated in the cleaned 13 chromosome haplotype 2 assembly.File: SLF_Hap1.gffDescription: A structural annotation of the 13 chromosome haplotype 1 assembly with functional annotations.File: SLF_Hap2.gffDescription: A structural annotation of the 13 chromosome haplotype 2 assembly with functional annotations.File: GO_plot_chr_1.pngDescription: An image of the top 20 GO term results for chromosome 1.File: GO_plot_chr_2.pngDescription: An image of the top 20 GO term results for chromosome 2.File: GO_plot_chr_3.pngDescription: An image of the top 20 GO term results for chromosome 3.File: GO_plot_chr_8.pngDescription: An image of the top 20 GO term results for chromosome 8.File: GO_plot_chr_5.pngDescription: An image of the top 20 GO term results for chromosome 5.File: GO_plot_chr_4.pngDescription: An image of the top 20 GO term results for chromosome 4.File: GO_plot_chr_6.pngDescription: An image of the top 20 GO term results for chromosome 6.File: GO_plot_chr_7.pngDescription: An image of the top 20 GO term results for chromosome 7.File: GO_plot_chr_11.pngDescription: An image of the top 20 GO term results for chromosome 11.File: GO_plot_chr_9.pngDescription: An image of the top 20 GO term results for chromosome 9.File: GO_plot_chr_10.pngDescription: An image of the top 20 GO term results for chromosome 10.File: GO_plot_chr_12.pngDescription: An image of the top 20 GO term results for chromosome 12.File: GO_plot_chr_13.pngDescription: An image of the top 20 GO term results for chromosome 13.File: SLF_Samples_SRA.csvDescription: A csv with the sequencing information, SRA numbers, and sexes of the adults used in to identify the putative sex chromosome.File: SLF_RNAseq_Metadata.csvDescription: A csv with the sequencing information, SRA numbers, and other metadata for the RNAseq used to annotation the genomes.Variablesaccession: The SRA accession numberstudy: The studyobject_status: If the NCBI submission was new or not.bioproject_accession: The bioproject accession numberbiosample_accession: The Biosample accession numberlibrary_ID: The ID used to identify that genomic library.title: The title of the study (the bioproject)library_strategy: Specific sequencing technique used to prepare the library.library_source: The biological material used to create the sequencing library.library_selection: The library preparation method.library_layout: The arrangement of reads within the sequencing library.platform: The sequencing platform.instrument_model: The model of the sequences.design_description: Description of the study design.filetype: Type of filefilename: First filefilename2: Second filesex: The biological sex of the adult.Code/softwareThe initial haplotype-phased scaffolded genome was assembled by Dovetail Genomics (Cantata Bio) with standard software outlined in the methods with default settings. Scripts for the remaining work including annotation, gene ontology enrichment, and other analyses are located in the Github repository (https://github.com/anthonysnead/SLF-Genome-Assembly(opens in new window)).Access informationOther publicly accessible locations of the data:The raw sequencing data and the annotated haplotype-phased genome assembly of Lycorma delicatula have been deposited at the National Center for Biotechnology Information (NCBI). The Hi-C and HiFi data can be found under SRA31402152 and SRA31402153. The RNA-seq data can be found under SRA31411873-SRA31411894, while the DNA-seq data can be found under bioproject PRJNA1136004. 
    more » « less
  3. de los Campos, G (Ed.)
    Abstract De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync’s correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions. 
    more » « less
  4. The combination of ultra-long (UL) Oxford Nanopore Technologies (ONT) sequencing reads with long, accurate Pacific Bioscience (PacBio) High Fidelity (HiFi) reads has enabled the completion of a human genome and spurred similar efforts to complete the genomes of many other species. However, this approach for complete, “telomere-to-telomere” genome assembly relies on multiple sequencing platforms, limiting its accessibility. ONT “Duplex” sequencing reads, where both strands of the DNA are read to improve quality, promise high per-base accuracy. To evaluate this new data type, we generated ONT Duplex data for three widely studied genomes: human HG002, Solanum lycopersicum Heinz 1706 (tomato), and Zea mays B73 (maize). For the diploid, heterozygous HG002 genome, we also used “Pore-C” chromatin contact mapping to completely phase the haplotypes. We found the accuracy of Duplex data to be similar to HiFi sequencing, but with read lengths tens of kilobases longer, and the Pore-C data to be compatible with existing diploid assembly algorithms. This combination of read length and accuracy enables the construction of a high-quality initial assembly, which can then be further resolved using the UL reads, and finally phased into chromosome-scale haplotypes with Pore-C. The resulting assemblies have a base accuracy exceeding 99.999% (Q50) and near-perfect continuity, with most chromosomes assembled as single contigs. We conclude that ONT sequencing is a viable alternative to HiFi sequencing for de novo genome assembly, and provides a multirun single-instrument solution for the reconstruction of complete genomes. 
    more » « less
  5. Abstract Genomic resources across squamate reptiles (lizards and snakes) have lagged behind other vertebrate systems and high-quality reference genomes remain scarce. Of the 23 chromosome-scale reference genomes across the order, only 12 of the ~60 squamate families are represented. Within geckos (infraorder Gekkota), a species-rich clade of lizards, chromosome-level genomes are exceptionally sparse representing only two of the seven extant families. Using the latest advances in genome sequencing and assembly methods, we generated one of the highest-quality squamate genomes to date for the leopard gecko, Eublepharis macularius (Eublepharidae). We compared this assembly to the previous, short-read only, E. macularius reference genome published in 2016 and examined potential factors within the assembly influencing contiguity of genome assemblies using PacBio HiFi data. Briefly, the read N50 of the PacBio HiFi reads generated for this study was equal to the contig N50 of the previous E. macularius reference genome at 20.4 kilobases. The HiFi reads were assembled into a total of 132 contigs, which was further scaffolded using HiC data into 75 total sequences representing all 19 chromosomes. We identified 9 of the 19 chromosomal scaffolds were assembled as a near-single contig, whereas the other 10 chromosomes were each scaffolded together from multiple contigs. We qualitatively identified that the percent repeat content within a chromosome broadly affects its assembly contiguity prior to scaffolding. This genome assembly signifies a new age for squamate genomics where high-quality reference genomes rivaling some of the best vertebrate genome assemblies can be generated for a fraction of previous cost estimates. This new E. macularius reference assembly is available on NCBI at JAOPLA010000000. 
    more » « less