The high sequencing error rate has impeded the application of long noisy reads for diploid genome assembly. Most existing assemblers failed to generate high-quality phased assemblies using long noisy reads. Here, we present PECAT, a
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract P hasedE rrorC orrection andA ssemblyT ool, for reconstructing diploid genomes from long noisy reads. We design a haplotype-aware error correction method that can retain heterozygote alleles while correcting sequencing errors. We combine a corrected read SNP caller and a raw read SNP caller to further improve the identification of inconsistent overlaps in the string graph. We use a grouping method to assign reads to different haplotype groups. PECAT efficiently assembles diploid genomes using Nanopore R9, PacBio CLR or Nanopore R10 reads only. PECAT generates more contiguous haplotype-specific contigs compared to other assemblers. Especially, PECAT achieves nearly haplotype-resolved assembly onB. taurus (Bison×Simmental) using Nanopore R9 reads and phase block NG50 with 59.4/58.0 Mb for HG002 using Nanopore R10 reads. -
Abstract Long single-molecular sequencing technologies, such as PacBio circular consensus sequencing (CCS) and nanopore sequencing, are advantageous in detecting DNA 5-methylcytosine in CpGs (5mCpGs), especially in repetitive genomic regions. However, existing methods for detecting 5mCpGs using PacBio CCS are less accurate and robust. Here, we present ccsmeth, a deep-learning method to detect DNA 5mCpGs using CCS reads. We sequence polymerase-chain-reaction treated and M.SssI-methyltransferase treated DNA of one human sample using PacBio CCS for training ccsmeth. Using long (≥10 Kb) CCS reads, ccsmeth achieves 0.90 accuracy and 0.97 Area Under the Curve on 5mCpG detection at single-molecule resolution. At the genome-wide site level, ccsmeth achieves >0.90 correlations with bisulfite sequencing and nanopore sequencing using only 10× reads. Furthermore, we develop a Nextflow pipeline, ccsmethphase, to detect haplotype-aware methylation using CCS reads, and then sequence a Chinese family trio to validate it. ccsmeth and ccsmethphase can be robust and accurate tools for detecting DNA 5-methylcytosines.
-
Abstract Although long-read single-cell RNA isoform sequencing (scISO-Seq) can reveal alternative RNA splicing in individual cells, it suffers from a low read throughput. Here, we introduce HIT-scISOseq, a method that removes most artifact cDNAs and concatenates multiple cDNAs for PacBio circular consensus sequencing (CCS) to achieve high-throughput and high-accuracy single-cell RNA isoform sequencing. HIT-scISOseq can yield >10 million high-accuracy long-reads in a single PacBio Sequel II SMRT Cell 8M. We also report the development of scISA-Tools that demultiplex HIT-scISOseq concatenated reads into single-cell cDNA reads with >99.99% accuracy and specificity. We apply HIT-scISOseq to characterize the transcriptomes of 3375 corneal limbus cells and reveal cell-type-specific isoform expression in them. HIT-scISOseq is a high-throughput, high-accuracy, technically accessible method and it can accelerate the burgeoning field of long-read single-cell transcriptomics.
-
Abstract Sweet orange originated from the introgressive hybridizations of pummelo and mandarin resulting in a highly heterozygous genome. How alleles from the two species cooperate in shaping sweet orange phenotypes under distinct circumstances is unknown. Here, we assembled a chromosome-level phased diploid Valencia sweet orange (DVS) genome with over 99.999% base accuracy and 99.2% gene annotation BUSCO completeness. DVS enables allele-level studies for sweet orange and other hybrids between pummelo and mandarin. We first configured an allele-aware transcriptomic profiling pipeline and applied it to 740 sweet orange transcriptomes. On average, 32.5% of genes have a significantly biased allelic expression in the transcriptomes. Different cultivars, transgenic lineages, tissues, development stages, and disease status all impacted allelic expressions and resulted in diversified allelic expression patterns in sweet orange, but particularly citrus Huanglongbing (HLB) shifted the allelic expression of hundreds of genes in leaves and calyx abscission zones. In addition, we detected allelic structural mutations in an HLB-tolerant mutant (T19) and a more sensitive mutant (T78) through long-read sequencing. The irradiation-induced structural mutations mostly involved double-strand breaks, while most spontaneous structural mutations were transposon insertions. In the mutants, most genes with significant allelic expression ratio alterations (≥1.5-fold) were directly affected by those structural mutations. In T19, alleles located at a translocated segment terminal were upregulated, including CsDnaJ, CsHSP17.4B, and CsCEBPZ. Their upregulation is inferred to keep phloem protein homeostasis under the stress from HLB and enable subsequent stress responses observed in T19. DVS will advance allelic level studies in citrus.more » « less
-
Abstract Fractionally doped perovskites oxides (FDPOs) have demonstrated ubiquitous applications such as energy conversion, storage and harvesting, catalysis, sensor, superconductor, ferroelectric, piezoelectric, magnetic, and luminescence. Hence, an accurate, cost-effective, and easy-to-use methodology to discover new compositions is much needed. Here, we developed a function-confined machine learning methodology to discover new FDPOs with high prediction accuracy from limited experimental data. By focusing on a specific application, namely solar thermochemical hydrogen production, we collected 632 training data and defined 21 desirable features. Our gradient boosting classifier model achieved a high prediction accuracy of 95.4% and a high F1 score of 0.921. Furthermore, when verified on additional 36 experimental data from existing literature, the model showed a prediction accuracy of 94.4%. With the help of this machine learning approach, we identified and synthesized 11 new FDPO compositions, 7 of which are relevant for solar thermochemical hydrogen production. We believe this confined machine learning methodology can be used to discover, from limited data, FDPOs with other specific application purposes.more » « less