skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: UnigeneFinder: An Automated Pipeline for Gene Calling From Transcriptome Assemblies Without a Reference Genome
ABSTRACT For most species, transcriptome data are much more readily available than genome data. Without a reference genome, gene calling is cumbersome and inaccurate because of the high degree of redundancy in de novo transcriptome assemblies. To simplify and increase the accuracy of de novo transcriptome assembly in the absence of a reference genome, we developed UnigeneFinder. Combining several clustering methods, UnigeneFinder substantially reduces the redundancy typical of raw transcriptome assemblies. This pipeline offers an effective solution to the problem of inflated transcript numbers, achieving a closer representation of the actual underlying genome. UnigeneFinder performs comparably or better, compared with existing tools, on plant species with varying genome complexities. UnigeneFinder is the only available transcriptome redundancy solution that fully automates the generation of primary transcript, coding region, and protein sequences, analogous to those available for high‐quality reference genomes. These features, coupled with the pipeline’s cross‐platform implementation, focus on automation, and an accessible, user‐friendly interface, make UnigeneFinder a useful tool for many downstream sequence‐based analyses in nonmodel organisms lacking a reference genome, including differential gene expression analysis, accurate ortholog identification, functional enrichments, and evolutionary analyses. UnigeneFinder also runs efficiently both on high‐performance computing (HPC) systems and personal computers, further reducing barriers to use.  more » « less
Award ID(s):
2434687
PAR ID:
10585942
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Plant Direct
Volume:
9
Issue:
4
ISSN:
2475-4455
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Holland, J. (Ed.)
    Abstract Douglas-fir (Pseudotsuga menziesii) is native to western North America. It grows in a wide range of environmental conditions and is an important timber tree. Although there are several studies on the gene expression responses of Douglas-fir to abiotic cues, the absence of high-quality transcriptome and genome data is a barrier to further investigation. Like for most conifers, the available transcriptome and genome reference dataset for Douglas-fir remains fragmented and requires refinement. We aimed to generate a highly accurate, and complete reference transcriptome and genome annotation. We deep-sequenced the transcriptome of Douglas-fir needles from seedlings that were grown under nonstress control conditions or a combination of heat and drought stress conditions using long-read (LR) and short-read (SR) sequencing platforms. We used 2 computational approaches, namely de novo and genome-guided LR transcriptome assembly. Using the LR de novo assembly, we identified 1.3X more high-quality transcripts, 1.85X more “complete” genes, and 2.7X more functionally annotated genes compared to the genome-guided assembly approach. We predicted 666 long noncoding RNAs and 12,778 unique protein-coding transcripts including 2,016 putative transcription factors. We leveraged the LR de novo assembled transcriptome with paired-end SR and a published single-end SR transcriptome to generate an improved genome annotation. This was conducted with BRAKER2 and refined based on functional annotation, repetitive content, and transcriptome alignment. This high-quality genome annotation has 51,419 unique gene models derived from 322,631 initial predictions. Overall, our informatics approach provides a new reference Douglas-fir transcriptome assembly and genome annotation with considerably improved completeness and functional annotation. 
    more » « less
  2. Although the reference genome of Solanum tuberosum Group Phureja double-monoploid (DM) clone is available, knowledge on the genetic diversity of the highly heterozygous tetraploid Group Tuberosum, representing most cultivated varieties, remains largely unexplored. This lack of knowledge hinders further progress in potato research. In conducted investigation, we first merged and manually curated the two existing partially-overlapping DM genome-based gene models, creating a union of genes in Phureja scaffold. Next, we compiled available and newly generated RNA-Seq datasets (cca. 1.5 billion reads) for three tetraploid potato genotypes (cultivar Désirée, cultivar Rywal, and breeding clone PW363) with diverse breeding pedigrees. Short-read transcriptomes were assembled using several de novo assemblers under different settings to test for optimal outcome. For cultivar Rywal, PacBio Iso-Seq full-length transcriptome sequencing was also performed. EvidentialGene redundancy-reducing pipeline complemented with in-house developed scripts was employed to produce accurate and complete cultivar-specific transcriptomes, as well as to attain the pan-transcriptome. The generated transcriptomes and pan-transcriptome represent a valuable resource for potato gene variability exploration, high-throughput omics analyses, and breeding programmes. 
    more » « less
  3. Blaimer, Bonnie (Ed.)
    Abstract A rapid proliferation in the availability of whole genome sequences (WGS), often with relatively low read depth, offers an unprecedented opportunity for phylogenomic advances using publicly available data, but there are several key challenges in applying these data. Using low‐coverage WGS data for the ant species ofFormica, we conducted detailed comparisons on two different analytical pipelines (reference‐based vs. de novo genome assembly), four types of datasets (5‐kbp‐window, ultra‐conserved element [UCE], single‐copy ortholog [BUSCO] and mitogenome), and a series of analytical procedures (e.g. concatenation vs. coalescent analyses) to identify which are robust to typical WGS data. The results show that at a shallow scale of phylogenetic relationships of closely related species 5‐kbp‐windows from the reference‐based pipeline and UCEs from the de novo assemblies are more successful than the BUSCOs in recovering informative markers for phylogenetic inference. Compared with concatenation analyses, coalescent analyses often resulted in disparate deeper relationships in the phylogeny. This study also uncovers evident mito‐nuclear discordance and demonstrates genome‐wide gene conflicts in phylogenetic signals, both pointing to possible incomplete lineage sorting and/or hybridization during the early, rapid radiation ofFormicaants. Divergence dating analyses show that different types of data and analytical methods could result in inconsistent time estimates, highlighting the potential need for multiple approaches to better understand species divergence. The strengths and weaknesses of different analytical pipelines and strategies are discussed. Findings from this study provide valuable insights for large‐scale phylogenomic projects using WGS data. 
    more » « less
  4. Abstract Exserohilum turcicum causes northern corn leaf blight and sorghum leaf blight. While the same species cause disease in both crops, the strains are host-specific. Here, we report the sequence and de novo annotated assemblies of one sorghum- and one maize-specific E. turcicum strain. The strains were sequenced using the PacBio Sequel II system. The total genome length for both assemblies was between 44 and 45 Mb with N50 of ∼2.5 Mb. Ninety-eight percent of the Benchmarking Universal Single-Copy Orthologs (BUSCO) for both assemblies had complete status. The estimated number of genes was 11,762 and 12,029 in the sorghum- and maize-specific isolates, respectively. Funannotate, EffectorP, SignalP, and transcriptome data were used to create functional annotation of each genome. The whole-genome comparison identified ten large-scale inversions and three translocations between the maize- and sorghum-specific strains, along with homologous genes and gene duplications. RNA was sequenced from the maize- and sorghum-specific isolate 10 days post-inoculation in maize and sorghum and from axenic cultures. Gene expression data from planta and axenic growth experiments were compared for each strain. Candidate host-specificity genes were identified by combining results from whole-genome comparison, synteny analysis, gene annotations, and transcriptome data. Overall, this study identified several candidate host-specificity genes that provide insights into E. turcicum interaction with its hosts. 
    more » « less
  5. Background: The family Batrachoididae are a group of ecologically important teleost fishes with unique life histories, behavior, and physiology that has made them popular model organisms. Batrachoididae remain understudied in the realm of genomics, with only four reference genome assemblies available for the family, with three being highly fragmented and not up to current assembly standards. Among these is the Gulf toadfish, Opsanus beta, a model organism for serotonin physiology which has recently been bred in captivity. Results: Here we present a new, de novo genome and transcriptome assemblies for the Gulf toadfish using PacBio long read technology. The genome size of the final assembly is 2.1 gigabases, which is among the largest teleost genomes. This new assembly improves significantly upon the currently available reference for Opsanus beta with a final scaffold count of 62, of which 23 are chromosome scale, an N50 of 98,402,768, and a BUSCO completeness score of 97.3%. Annotation with ab initio and transcriptome-based methods generated 41,076 gene models. The genome is highly repetitive, with ~ 70% of the genome composed of simple repeats and transposable elements. Satellite DNA analysis identified potential telomeric and centromeric regions. Conclusions: This improved assembly represents a valuable resource for future research using this important model organism and to teleost genomics more broadly. 
    more » « less