skip to main content

Title: HolistIC: leveraging Hi–C and whole genome shotgun sequencing for double minute chromosome discovery
Abstract Motivation

Double minute (DM) chromosomes are acentric extrachromosomal DNA artifacts that are frequently observed in the cells of numerous cancers. They are highly amplified and contain oncogenes and drug-resistance genes, making their presence a challenge for effective cancer treatment. Algorithmic discovery of DM can potentially improve bench-derived therapies for cancer treatment. A hindrance to this task is that DMs evolve, yielding circular chromatin that shares segments from progenitor DMs. This creates DMs with overlapping amplicon coordinates. Existing DM discovery algorithms use whole genome shotgun sequencing (WGS) in isolation, which can potentially incorrectly classify DMs that share overlapping coordinates.

Results

In this study, we describe an algorithm called ‘HolistIC’ that can predict DMs in tumor genomes by integrating WGS and Hi–C sequencing data. The consolidation of these sources of information resolves ambiguity in DM amplicon prediction that exists in DM prediction with WGS data used in isolation. We implemented and tested our algorithm on the tandem Hi–C and WGS datasets of three cancer datasets and a simulated dataset. Results on the cancer datasets demonstrated HolistIC’s ability to predict DMs from Hi–C and WGS data in tandem. The results on the simulated data showed the HolistIC can accurately distinguish DMs that have overlapping more » amplicon coordinates, an advance over methods that predict extrachromosomal amplification using WGS data in isolation.

Availability and implementation

Our software, named ‘HolistIC’, is available at http://www.github.com/mhayes20/HolistIC.

Supplementary information

Supplementary data are available at Bioinformatics online.

« less
Authors:
; ; ; ; ; ; ;
Award ID(s):
1901258
Publication Date:
NSF-PAR ID:
10362472
Journal Name:
Bioinformatics
Volume:
38
Issue:
5
Page Range or eLocation-ID:
p. 1208-1215
ISSN:
1367-4803
Publisher:
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    High throughput chromosome conformation capture (Hi-C) contact matrices are used to predict 3D chromatin structures in eukaryotic cells. High-resolution Hi-C data are less available than low-resolution Hi-C data due to sequencing costs but provide greater insight into the intricate details of 3D chromatin structures such as enhancer–promoter interactions and sub-domains. To provide a cost-effective solution to high-resolution Hi-C data collection, deep learning models are used to predict high-resolution Hi-C matrices from existing low-resolution matrices across multiple cell types.

    Results

    Here, we present two Cascading Residual Networks called HiCARN-1 and HiCARN-2, a convolutional neural network and a generative adversarial network, that use a novel framework of cascading connections throughout the network for Hi-C contact matrix prediction from low-resolution data. Shown by image evaluation and Hi-C reproducibility metrics, both HiCARN models, overall, outperform state-of-the-art Hi-C resolution enhancement algorithms in predictive accuracy for both human and mouse 1/16, 1/32, 1/64 and 1/100 downsampled high-resolution Hi-C data. Also, validation by extracting topologically associating domains, chromosome 3D structure and chromatin loop predictions from the enhanced data shows that HiCARN can proficiently reconstruct biologically significant regions.

    Availability and implementation

    HiCARN can be accessed and utilized as an open-sourced software at: https://github.com/OluwadareLab/HiCARN and is also available as amore »containerized application that can be run on any platform.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. Abstract Background

    Three genes clustered together on chromosome 12 comprise a group of hydroxycarboxylic acid receptors (HCARs):HCAR1,HCAR2, andHCAR3. These paralogous genes encode different G-protein coupled receptors responsible for detecting glycolytic metabolites and controlling fatty acid oxidation. Though better known for regulating lipid metabolism in adipocytes, more recently, HCARs have been functionally associated with breast cancer proliferation/survival;HCAR2has been described as a tumor suppressor andHCAR1andHCAR3as oncogenes. Thus, we sought to identify germline variants inHCAR1,HCAR2,andHCAR3that could potentially be associated with breast cancer risk.

    Methods

    Two different cohorts of breast cancer cases were investigated, the Alabama Hereditary Cancer Cohort and The Cancer Genome Atlas, which were analyzed through nested PCRs/Sanger sequencing and whole-exome sequencing, respectively. All datasets were screened for rare, non-synonymous coding variants.

    Results

    Variants were identified in both breast cancer cohorts, some of which appeared to be associated with breast cancer BC risk, includingHCAR1c.58C > G (p.P20A),HCAR2c.424C > T (p.R142W),HCAR2c.517_518delinsAC (p.G173T),HCAR2c.1036A > G (p.M346V),HCAR2c.1086_1090del (p.P363Nfs*26),HCAR3c.560G > A (p.R187Q), andHCAR3c.1117delC (p.Q373Kfs*82). Additionally,HCAR2c.515C > T (p.S172L), a previously identified loss-of-function variant, was identified.

    Conclusions

    Due to the important role of HCARs in breast cancer, it is vital to understand how these genetic variants play a role in breast cancer risk and proliferation and their consequences on treatment strategies. Additional studies will be needed to validate these findings. Nevertheless,more »the identification of these potentially pathogenic variants supports the need to investigate their functional consequences.

    « less
  3. Abstract Motivation

    As RNA viruses mutate and adapt to environmental changes, often developing resistance to anti-viral vaccines and drugs, they form an ensemble of viral strains––a viral quasispecies. While high-throughput sequencing (HTS) has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.

    Results

    This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze HTS data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1–10% and detecting longmore »deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.

    Availability and implementation

    TenSQR is available at https://github.com/SoYeonA/TenSQR.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  4. Abstract Motivation

    Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.

    Results

    We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstratemore »that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.

    Availability and implementation

    HiFine is available at https://github.com/dyxstat/HiFine.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  5. Abstract Motivation

    Recent experiments have provided Hi-C data at resolution as high as 1 kbp. However, 3D structural inference from high-resolution Hi-C datasets is often computationally unfeasible using existing methods.

    Results

    We have developed miniMDS, an approximation of multidimensional scaling (MDS) that partitions a Hi-C dataset, performs high-resolution MDS separately on each partition, and then reassembles the partitions using low-resolution MDS. miniMDS is faster, more accurate, and uses less memory than existing methods for inferring the human genome at high resolution (10 kbp).

    Availability and implementation

    A Python implementation of miniMDS is available on GitHub: https://github.com/seqcode/miniMDS.

    Supplementary information

    Supplementary data are available at Bioinformatics online.