The sequence and assembly of human genomes using long‐read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high‐fidelity (HiFi) or continuous long‐read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5‐fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes.
- Award ID(s):
- 1927159
- NSF-PAR ID:
- 10351089
- Editor(s):
- Katz, Laura A.; Capone, Douglas G.
- Date Published:
- Journal Name:
- mBio
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2161-2129
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Abstract Background The eukaryotic genome is capable of producing multiple isoforms from a gene by alternative polyadenylation (APA) during pre-mRNA processing. APA in the 3′-untranslated region (3′-UTR) of mRNA produces transcripts with shorter or longer 3′-UTR. Often, 3′-UTR serves as a binding platform for microRNAs and RNA-binding proteins, which affect the fate of the mRNA transcript. Thus, 3′-UTR APA is known to modulate translation and provides a mean to regulate gene expression at the post-transcriptional level. Current bioinformatics pipelines have limited capability in profiling 3′-UTR APA events due to incomplete annotations and a low-resolution analyzing power: widely available bioinformatics pipelines do not reference actionable polyadenylation (cleavage) sites but simulate 3′-UTR APA only using RNA-seq read coverage, causing false positive identifications. To overcome these limitations, we developed APA-Scan, a robust program that identifies 3′-UTR APA events and visualizes the RNA-seq short-read coverage with gene annotations.
Methods APA-Scan utilizes either predicted or experimentally validated actionable polyadenylation signals as a reference for polyadenylation sites and calculates the quantity of long and short 3′-UTR transcripts in the RNA-seq data. APA-Scan works in three major steps: (i) calculate the read coverage of the 3′-UTR regions of genes; (ii) identify the potential APA sites and evaluate the significance of the events among two biological conditions; (iii) graphical representation of user specific event with 3′-UTR annotation and read coverage on the 3′-UTR regions. APA-Scan is implemented in Python3. Source code and a comprehensive user’s manual are freely available at
https://github.com/compbiolabucf/APA-Scan .Result APA-Scan was applied to both simulated and real RNA-seq datasets and compared with two widely used baselines DaPars and APAtrap. In simulation APA-Scan significantly improved the accuracy of 3′-UTR APA identification compared to the other baselines. The performance of APA-Scan was also validated by 3′-end-seq data and qPCR on mouse embryonic fibroblast cells. The experiments confirm that APA-Scan can detect unannotated 3′-UTR APA events and improve genome annotation.
Conclusion APA-Scan is a comprehensive computational pipeline to detect transcriptome-wide 3′-UTR APA events. The pipeline integrates both RNA-seq and 3′-end-seq data information and can efficiently identify the significant events with a high-resolution short reads coverage plots.
-
Parrish, Colin R. (Ed.)ABSTRACT Chloroviruses (family Phycodnaviridae ) are large double-stranded DNA (dsDNA) viruses that infect unicellular green algae present in inland waters. These viruses have been isolated using three main chlorella-like green algal host cells, traditionally called NC64A, SAG, and Pbi, revealing extensive genetic diversity. In this study, we performed a functional genomic analysis on 36 chloroviruses that infected the three different hosts. Phylogenetic reconstruction based on the DNA polymerase B family gene clustered the chloroviruses into three distinct clades. The viral pan-genome consists of 1,345 clusters of orthologous groups of genes (COGs), with 126 COGs conserved in all viruses. Totals of 368, 268, and 265 COGs are found exclusively in viruses that infect NC64A, SAG, and Pbi algal hosts, respectively. Two-thirds of the COGs have no known function, constituting the “dark pan-genome” of chloroviruses, and further studies focusing on these genes may identify important novelties. The proportions of functionally characterized COGs composing the pan-genome and the core-genome are similar, but those related to transcription and RNA processing, protein metabolism, and virion morphogenesis are at least 4-fold more represented in the core genome. Bipartite network construction evidencing the COG sharing among host-specific viruses identified 270 COGs shared by at least one virus from each of the different host groups. Finally, our results reveal an open pan-genome for chloroviruses and a well-established core genome, indicating that the isolation of new chloroviruses can be a valuable source of genetic discovery. IMPORTANCE Chloroviruses are large dsDNA viruses that infect unicellular green algae distributed worldwide in freshwater environments. They comprise a genetically diverse group of viruses; however, a comprehensive investigation of the genomic evolution of these viruses is still missing. Here, we performed a functional pan-genome analysis comprising 36 chloroviruses associated with three different algal hosts in the family Chlorellaceae , referred to as zoochlorellae because of their endosymbiotic lifestyle. We identified a set of 126 highly conserved genes, most of which are related to essential functions in the viral replicative cycle. Several genes are unique to distinct isolates, resulting in an open pan-genome for chloroviruses. This profile is associated with generalist organisms, and new insights into the evolution and ecology of chloroviruses are presented. Ultimately, our results highlight the potential for genetic diversity in new isolates.more » « less
-
Eyre-Walker, Adam (Ed.)Abstract Our knowledge of the Major Histocompatibility Complex (MHC) in birds is limited because it often consists of numerous duplicated genes within individuals that are difficult to assemble with short read sequencing technology. Long-read sequencing provides an opportunity to overcome this limitation because it allows the assembly of long regions with repetitive elements. In this study, we used genomes based on long-read sequencing to predict the number and location of MHC loci in a broad range of bird taxa. From the long-read-based genomes of 34 species, we found that there was extremely large variation in the number of MHC loci between species. Overall, there were greater numbers of both class I and II loci in passerines than nonpasserines. The highest numbers of loci (up to 193 class II loci) were found in manakins (Pipridae), which had previously not been studied at the MHC. Our results provide the first direct evidence from passerine genomes of this high level of duplication. We also found different duplication patterns between species. In some species, both MHC class I and II genes were duplicated together, whereas in most species they were duplicated independently. Our study shows that the analysis of long-read-based genomes can dramatically improve our knowledge of MHC structure, although further improvements in chromosome level assembly are needed to understand the evolutionary mechanisms producing the extraordinary interspecific variation in the architecture of the MHC region.more » « less
-
Introduction Nosema is a diverse genus of unicellular microsporidian parasites of insects and other arthropods.Nosema muscidifuracis infects parasitoid wasp species ofMuscidifurax zaraptor andM. raptor (Hymenoptera: Pteromalidae), causing ~50% reduction in longevity and ~90% reduction in fecundity.Methods and Results Here, we report the first assembly of the
N. muscidifuracis genome (14,397,169 bp in 28 contigs) of high continuity (contig N50 544.3 Kb) and completeness (BUSCO score 97.0%). A total of 2,782 protein-coding genes were annotated, with 66.2% of the genes having two copies and 24.0% of genes having three copies. These duplicated genes are highly similar, with a sequence identity of 99.3%. The complex pattern suggests extensive gene duplications and rearrangements across the genome. We annotated 57 rDNA loci, which are highly GC-rich (37%) in a GC-poor genome (25% genome average).Nosema -specific qPCR primer sets were designed based on 18S rDNA annotation as a diagnostic tool to determine its titer in host samples. We discovered highNosema titers inNosema -curedM. raptor andM. zaraptor using heat treatment in 2017 and 2019, suggesting that the remedy did not completely eliminate theNosema infection. Cytogenetic analyses revealed heavy infections ofN. muscidifuracis within the ovaries ofM. raptor andM. zaraptor , consistent with the titer determined by qPCR and suggesting a heritable component of infection and per ovum vertical transmission.Discussion The parasitoids-
Nosema system is laboratory tractable and, therefore, can serve as a model to inform future genome manipulations ofNosema -host system for investigations of Nosemosis.