skip to main content

Title: High resolution copy number inference in cancer using short-molecule nanopore sequencing
Abstract Genome copy number is an important source of genetic variation in health and disease. In cancer, Copy Number Alterations (CNAs) can be inferred from short-read sequencing data, enabling genomics-based precision oncology. Emerging Nanopore sequencing technologies offer the potential for broader clinical utility, for example in smaller hospitals, due to lower instrument cost, higher portability, and ease of use. Nonetheless, Nanopore sequencing devices are limited in the number of retrievable sequencing reads/molecules compared to short-read sequencing platforms, limiting CNA inference accuracy. To address this limitation, we targeted the sequencing of short-length DNA molecules loaded at optimized concentration in an effort to increase sequence read/molecule yield from a single nanopore run. We show that short-molecule nanopore sequencing reproducibly returns high read counts and allows high quality CNA inference. We demonstrate the clinical relevance of this approach by accurately inferring CNAs in acute myeloid leukemia samples. The data shows that, compared to traditional approaches such as chromosome analysis/cytogenetics, short molecule nanopore sequencing returns more sensitive, accurate copy number information in a cost effective and expeditious manner, including for multiplex samples. Our results provide a framework for short-molecule nanopore sequencing with applications in research and medicine, which includes but is not limited to, more » CNAs. « less
; ; ; ; ; ; ; ;
Award ID(s):
1627442 1920103
Publication Date:
Journal Name:
Nucleic Acids Research
Page Range or eLocation-ID:
e124 to e124
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background Every tumor is composed of heterogeneous clones, each corresponding to a distinct subpopulation of cells that accumulated different types of somatic mutations, ranging from single-nucleotide variants (SNVs) to copy-number aberrations (CNAs). As the analysis of this intra-tumor heterogeneity has important clinical applications, several computational methods have been introduced to identify clones from DNA sequencing data. However, due to technological and methodological limitations, current analyses are restricted to identifying tumor clones only based on either SNVs or CNAs, preventing a comprehensive characterization of a tumor’s clonal composition. Results To overcome these challenges, we formulate the identification of clones in terms of both SNVs and CNAs as a integration problem while accounting for uncertainty in the input SNV and CNA proportions. We thus characterize the computational complexity of this problem and we introduce PACTION (PArsimonious Clone Tree integratION), an algorithm that solves the problem using a mixed integer linear programming formulation. On simulated data, we show that tumor clones can be identified reliably, especially when further taking into account the ancestral relationships that can be inferred from the input SNVs and CNAs. On 49 tumor samples from 10 prostate cancer patients, our integration approach provides a higher resolution view ofmore »tumor evolution than previous studies. Conclusion PACTION is an accurate and fast method that reconstructs clonal architecture of cancer tumors by integrating SNV and CNA clones inferred using existing methods.« less
  2. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implementedmore »a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.« less
  3. Abstract Background

    Alterations in genes encoding chromatin regulatory proteins are prevalent in cancers and may confer oncogenic properties and molecular changes linked to therapy resistance. However, the impact of copy number alterations (CNAs) of the SWItch/Sucrose NonFermentable (SWI/SNF) complex on the oncogenic and immunologic properties has not been systematically explored across human cancer types.


    We comprehensively analyzed the genomic, transcriptomic and clinical data of The Cancer Genome Atlas (TCGA) dataset across 33 solid cancers.


    CNAs of the SWI/SNF components were identified in more than 25% of all queried cancers, and tumors harboring SWI/SNF CNAs demonstrated a worse overall survival (OS) than others in several cancer types. Mechanistically, the SCNA events in the SWI/SNF complex are correlated with dysregulated genomic features and oncogenic pathways, including the cell cycle, DNA damage and repair. Notably, the SWI/SNF CNAs were associated with homologous recombination deficiency (HRD) and improved clinical outcomes of platinum-treated ovarian cancer. Furthermore, we observed distinct immune infiltrating patterns and immunophenotypes associated with SWI/SNF CNAs in different cancer types.


    The CNA events of the SWI/SNF components are a key process linked to oncogenesis, immune infiltration and therapeutic responsiveness across human cancers.

  4. Microbes (bacteria, yeasts, molds), in addition to plants and animals, were domesticated for their roles in food preservation, nutrition and flavor. Aspergillus oryzae is a domesticated filamentous fungal species traditionally used during fermentation of Asian foods and beverage, such as sake, soy sauce, and miso. To date, little is known about the extent of genome and phenotypic variation of A. oryzae isolates from different clades. Here, we used long-read Oxford Nanopore and short-read Illumina sequencing to produce a highly accurate and contiguous genome assemble of A. oryzae 14160, an industrial strain from China. To understand the relationship of this isolate, we performed phylogenetic analysis with 90 A. oryzae isolates and 1 isolate of the A. oryzae progenitor, Aspergillus flavus . This analysis showed that A. oryzae 14160 is a member of clade A, in comparison to the RIB 40 type strain, which is a member of clade F. To explore genome variation between isolates from distinct A. oryzae clades, we compared the A. oryzae 14160 genome with the complete RIB 40 genome. Our results provide evidence of independent evolution of the alpha-amylase gene duplication, which is one of the major adaptive mutations resulting from domestication. Synteny analysis revealed that bothmore »genomes have three copies of the alpha-amylase gene, but only one copy on chromosome 2 was conserved. While the RIB 40 genome had additional copies of the alpha-amylase gene on chromosomes III, and V, 14160 had a second copy on chromosome II and an third copy on chromosome VI. Additionally, we identified hundreds of lineage specific genes, and putative high impact mutations in genes involved in secondary metabolism, including several of the core biosynthetic genes. Finally, to examine the functional effects of genome variation between strains, we measured amylase activity, proteolytic activity, and growth rate on several different substrates. RIB 40 produced significantly higher levels of amylase compared to 14160 when grown on rice and starch. Accordingly, RIB 40 grew faster on rice, while 14160 grew faster on soy. Taken together, our analyses reveal substantial genome and phenotypic variation within A. oryzae .« less
  5. Nanopore genome sequencing is the key to enabling personalized medicine, global food security, and virus surveillance. The state-of-the-art base-callers adopt deep neural networks (DNNs) to translate electrical signals generated by nanopore sequencers to digital DNA symbols. A DNN-based base-caller consumes 44.5% of total execution time of a nanopore sequencing pipeline. However, it is difficult to quantize a base-caller and build a power-efficient processing-in-memory (PIM) to run the quantized base-caller. Although conventional network quantization techniques reduce the computing overhead of a base-caller by replacing floating-point multiply-accumulations by cheaper fixed-point operations, it significantly increases the number of systematic errors that cannot be corrected by read votes. The power density of prior nonvolatile memory (NVM)-based PIMs has already exceeded memory thermal tolerance even with active heat sinks, because their power efficiency is severely limited by analog-to-digital converters (ADC). Finally, Connectionist Temporal Classification (CTC) decoding and read voting cost 53.7% of total execution time in a quantized base-caller, and thus became its new bottleneck. In this paper, we propose a novel algorithm/architecture co-designed PIM, Helix, to power-efficiently and accurately accelerate nanopore base-calling. From algorithm perspective, we present systematic error aware training to minimize the number of systematic errors in a quantized base-caller. From architecturemore »perspective, we propose a low-power SOT-MRAM-based ADC array to process analog-to-digital conversion operations and improve power efficiency of prior DNN PIMs. Moreover, we revised a traditional NVM-based dot-product engine to accelerate CTC decoding operations, and create a SOT-MRAM binary comparator array to process read voting. Compared to state-of-the-art PIMs, Helix improves base-calling throughput by 6x, throughput per Watt by 11.9x and per mm2 by 7.5x without degrading base-calling accuracy.« less