skip to main content


Title: Inference of a genome-wide protein-coding gene set of the inshore hagfish Eptatretus burgeri
The hagfishes (Myxiniformes) arose from agnathan (jawless vertebrate) lineages and they are one of only two extant cyclostome taxa, together with lampreys (Petromyzontiformes). Even though whole genome sequencing has been achieved for diverse vertebrate taxa, genome-wide sequence information has been highly limited for cyclostomes. Here we sequenced the genome of the inshore hagfish Eptatretus burgeri using DNA extracted from the testis, with a short-read sequencing platform, aiming to reconstruct a high-coverage protein-coding gene catalogue. The obtained genome assembly, scaffolded with mate-pair reads and paired RNA-seq reads, exhibited an N50 scaffold length of 293 Kbp, which allowed the genome-wide prediction of coding genes. This computation resulted in the gene models whose completeness was estimated at the complete coverage of more than 83 % and the partial coverage of more than 93 % by referring to evolutionarily conserved single-copy orthologs. The high contiguity of the assembly and completeness of the gene models promise a high utility in various comparative analyses including phylogenomics and phylome exploration.  more » « less
Award ID(s):
1818012
NSF-PAR ID:
10432052
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
F1000Research
Volume:
11
ISSN:
2046-1402
Page Range / eLocation ID:
1270
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Holland, J. (Ed.)
    Abstract

    Douglas-fir (Pseudotsuga menziesii) is native to western North America. It grows in a wide range of environmental conditions and is an important timber tree. Although there are several studies on the gene expression responses of Douglas-fir to abiotic cues, the absence of high-quality transcriptome and genome data is a barrier to further investigation. Like for most conifers, the available transcriptome and genome reference dataset for Douglas-fir remains fragmented and requires refinement. We aimed to generate a highly accurate, and complete reference transcriptome and genome annotation. We deep-sequenced the transcriptome of Douglas-fir needles from seedlings that were grown under nonstress control conditions or a combination of heat and drought stress conditions using long-read (LR) and short-read (SR) sequencing platforms. We used 2 computational approaches, namely de novo and genome-guided LR transcriptome assembly. Using the LR de novo assembly, we identified 1.3X more high-quality transcripts, 1.85X more “complete” genes, and 2.7X more functionally annotated genes compared to the genome-guided assembly approach. We predicted 666 long noncoding RNAs and 12,778 unique protein-coding transcripts including 2,016 putative transcription factors. We leveraged the LR de novo assembled transcriptome with paired-end SR and a published single-end SR transcriptome to generate an improved genome annotation. This was conducted with BRAKER2 and refined based on functional annotation, repetitive content, and transcriptome alignment. This high-quality genome annotation has 51,419 unique gene models derived from 322,631 initial predictions. Overall, our informatics approach provides a new reference Douglas-fir transcriptome assembly and genome annotation with considerably improved completeness and functional annotation.

     
    more » « less
  2. Abstract Background The pond snail, Lymnaea stagnalis ( L. stagnalis ), has served as a valuable model organism for neurobiology studies due to its simple and easily accessible central nervous system (CNS). L. stagnalis has been widely used to study neuronal networks and recently gained popularity for study of aging and neurodegenerative diseases. However, previous transcriptome studies of L. stagnalis CNS have been exclusively carried out on adult L. stagnalis only. As part of our ongoing effort studying L. stagnalis neuronal growth and connectivity at various developmental stages, we provide the first age-specific transcriptome analysis and gene annotation of young (3 months), adult (6 months), and old (18 months) L. stagnalis CNS. Results Using the above three age cohorts, our study generated 55–69 millions of 150 bp paired-end RNA sequencing reads using the Illumina NovaSeq 6000 platform. Of these reads, ~ 74% were successfully mapped to the reference genome of L. stagnalis . Our reference-based transcriptome assembly predicted 42,478 gene loci, of which 37,661 genes encode coding sequences (CDS) of at least 100 codons. In addition, we provide gene annotations using Blast2GO and functional annotations using Pfam for ~ 95% of these sequences, contributing to the largest number of annotated genes in L. stagnalis CNS so far. Moreover, among 242 previously cloned L. stagnalis genes, we were able to match ~ 87% of them in our transcriptome assembly, indicating a high percentage of gene coverage. The expressional differences for innexins, FMRFamide, and molluscan insulin peptide genes were validated by real-time qPCR. Lastly, our transcriptomic analyses revealed distinct, age-specific gene clusters, differentially expressed genes, and enriched pathways in young, adult, and old CNS. More specifically, our data show significant changes in expression of critical genes involved in transcription factors, metabolisms (e.g. cytochrome P450), extracellular matrix constituent, and signaling receptor and transduction (e.g. receptors for acetylcholine, N-Methyl-D-aspartic acid, and serotonin), as well as stress- and disease-related genes in young compared to either adult or old snails. Conclusions Together, these datasets are the largest and most updated L. stagnalis CNS transcriptomes, which will serve as a resource for future molecular studies and functional annotation of transcripts and genes in L. stagnalis . 
    more » « less
  3. Abstract Background Modern sequencing technologies should make the assembly of the relatively small mitochondrial genomes an easy undertaking. However, few tools exist that address mitochondrial assembly directly. Results As part of the Vertebrate Genomes Project (VGP) we develop mitoVGP, a fully automated pipeline for similarity-based identification of mitochondrial reads and de novo assembly of mitochondrial genomes that incorporates both long (> 10 kbp, PacBio or Nanopore) and short (100–300 bp, Illumina) reads. Our pipeline leads to successful complete mitogenome assemblies of 100 vertebrate species of the VGP. We observe that tissue type and library size selection have considerable impact on mitogenome sequencing and assembly. Comparing our assemblies to purportedly complete reference mitogenomes based on short-read sequencing, we identify errors, missing sequences, and incomplete genes in those references, particularly in repetitive regions. Our assemblies also identify novel gene region duplications. The presence of repeats and duplications in over half of the species herein assembled indicates that their occurrence is a principle of mitochondrial structure rather than an exception, shedding new light on mitochondrial genome evolution and organization. Conclusions Our results indicate that even in the “simple” case of vertebrate mitogenomes the completeness of many currently available reference sequences can be further improved, and caution should be exercised before claiming the complete assembly of a mitogenome, particularly from short reads alone. 
    more » « less
  4. Abstract Motivation

    De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e. sequencing errors, uneven sequencing coverage and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g. mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.

    Results

    The concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in one of our recent papers that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g. 50 Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness.

    Availability and implementation

    Novo&Stitch can be obtained from https://github.com/ucrbioinfo/Novo_Stitch.

     
    more » « less
  5. Abstract Background

    The Aldabra giant tortoise (Aldabrachelys gigantea) is one of only two giant tortoise species left in the world. The species is endemic to Aldabra Atoll in Seychelles and is listed as Vulnerable on the International Union for Conservation of Nature Red List (v2.3) due to its limited distribution and threats posed by climate change. Genomic resources for A. gigantea are lacking, hampering conservation efforts for both wild and ex situpopulations. A high-quality genome would also open avenues to investigate the genetic basis of the species’ exceptionally long life span.

    Findings

    We produced the first chromosome-level de novo genome assembly of A. gigantea using PacBio High-Fidelity sequencing and high-throughput chromosome conformation capture. We produced a 2.37-Gbp assembly with a scaffold N50 of 148.6 Mbp and a resolution into 26 chromosomes. RNA sequencing–assisted gene model prediction identified 23,953 protein-coding genes and 1.1 Gbp of repetitive sequences. Synteny analyses among turtle genomes revealed high levels of chromosomal collinearity even among distantly related taxa. To assess the utility of the high-quality assembly for species conservation, we performed a low-coverage resequencing of 30 individuals from wild populations and two zoo individuals. Our genome-wide population structure analyses detected genetic population structure in the wild and identified the most likely origin of the zoo-housed individuals. We further identified putatively deleterious mutations to be monitored.

    Conclusions

    We establish a high-quality chromosome-level reference genome for A. gigantea and one of the most complete turtle genomes available. We show that low-coverage whole-genome resequencing, for which alignment to the reference genome is a necessity, is a powerful tool to assess the population structure of the wild population and reveal the geographic origins of ex situ individuals relevant for genetic diversity management and rewilding efforts.

     
    more » « less