skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Raphael, Benjamin J."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. null (Ed.)
  2. Single-cell RNA-sequencing (scRNA-seq) enables high throughput measurement of RNA expression in individual cells. Due to technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard analysis methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc combines network-regularized non-negative matrix factorization with a procedure for handling zero inflation in transcript count matrices. The matrix factorization results in a low-dimensional representation of the transcript count matrix, which imputes gene abundance for both zero and non-zero entries and can be used to cluster cells. The network regularization leverages prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be close in the low-dimensional representation. We show that netNMF-sc outperforms existing methods on simulated and real scRNA-seq data, with increasing advantage at higher dropout rates (e.g. above 60%). Furthermore, we show that the results from netNMF-sc -- including estimation of gene-gene covariance -- are robust to choice of network, with more representative networks leading to greater performance gains. 
    more » « less
  3. Abstract Motivation

    Current technologies for single-cell DNA sequencing require whole-genome amplification (WGA), as a single cell contains too little DNA for direct sequencing. Unfortunately, WGA introduces biases in the resulting sequencing data, including non-uniformity in genome coverage and high rates of allele dropout. These biases complicate many downstream analyses, including the detection of genomic variants.

    Results

    We show that amplification biases have a potential upside: long-range correlations in rates of allele dropout provide a signal for phasing haplotypes at the lengths of amplicons from WGA, lengths which are generally longer than than individual sequence reads. We describe a statistical test to measure concurrent allele dropout between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. We use results of this test to perform haplotype assembly across a collection of single cells. We demonstrate that the algorithm predicts phasing between pairs of SNPs with higher accuracy than phasing from reads alone. Using whole-genome sequencing data from only seven neural cells, we obtain haplotype blocks that are orders of magnitude longer than with sequence reads alone (median length 10.2 kb versus 312 bp), with error rates <2%. We demonstrate similar advantages on whole-exome data from 16 cells, where we obtain haplotype blocks with median length 9.2 kb—comparable to typical gene lengths—compared with median lengths of 41 bp with sequence reads alone, with error rates <4%. Our algorithm will be useful for haplotyping of rare alleles and studies of allele-specific somatic aberrations.

    Availability and implementation

    Source code is available at https://www.github.com/raphael-group.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    The analysis of high-dimensional ‘omics data is often informed by the use of biological interaction networks. For example, protein–protein interaction networks have been used to analyze gene expression data, to prioritize germline variants, and to identify somatic driver mutations in cancer. In these and other applications, the underlying computational problem is to identify altered subnetworks containing genes that are both highly altered in an ‘omics dataset and are topologically close (e.g. connected) on an interaction network.

    Results

    We introduce Hierarchical HotNet, an algorithm that finds a hierarchy of altered subnetworks. Hierarchical HotNet assesses the statistical significance of the resulting subnetworks over a range of biological scales and explicitly controls for ascertainment bias in the network. We evaluate the performance of Hierarchical HotNet and several other algorithms that identify altered subnetworks on the problem of predicting cancer genes and significantly mutated subnetworks. On somatic mutation data from The Cancer Genome Atlas, Hierarchical HotNet outperforms other methods and identifies significantly mutated subnetworks containing both well-known cancer genes and candidate cancer genes that are rarely mutated in the cohort. Hierarchical HotNet is a robust algorithm for identifying altered subnetworks across different ‘omics datasets.

    Availability and implementation

    http://github.com/raphael-group/hierarchical-hotnet.

    Supplementary information

    Supplementary material are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    The traditional view of cancer evolution states that a cancer genome accumulates a sequential ordering of mutations over a long period of time. However, in recent years it has been suggested that a cancer genome may instead undergo a one-time catastrophic event, such as chromothripsis, where a large number of mutations instead occur simultaneously. A number of potential signatures of chromothripsis have been proposed. In this work, we provide a rigorous formulation and analysis of the ‘ability to walk the derivative chromosome’ signature originally proposed by Korbel and Campbell. In particular, we show that this signature, as originally envisioned, may not always be present in a chromothripsis genome and we provide a precise quantification of under what circumstances it would be present. We also propose a variation on this signature, the H/T alternating fraction, which allows us to overcome some of the limitations of the original signature.

    Results

    We apply our measure to both simulated data and a previously analyzed real cancer dataset and find that the H/T alternating fraction may provide useful signal for distinguishing genomes having acquired mutations simultaneously from those acquired in a sequential fashion.

    Availability and implementation

    An implementation of the H/T alternating fraction is available at https://bitbucket.org/oesperlab/ht-altfrac.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  6. Abstract Motivation

    The somatic mutations in the pathways that drive cancer development tend to be mutually exclusive across tumors, providing a signal for distinguishing driver mutations from a larger number of random passenger mutations. This mutual exclusivity signal can be confounded by high and highly variable mutation rates across a cohort of samples. Current statistical tests for exclusivity that incorporate both per-gene and per-sample mutational frequencies are computationally expensive and have limited precision.

    Results

    We formulate a weighted exact test for assessing the significance of mutual exclusivity in an arbitrary number of mutational events. Our test conditions on the number of samples with a mutation as well as per-event, per-sample mutation probabilities. We provide a recursive formula to compute P-values for the weighted test exactly as well as a highly accurate and efficient saddlepoint approximation of the test. We use our test to approximate a commonly used permutation test for exclusivity that conditions on per-event, per-sample mutation frequencies. However, our test is more efficient and it recovers more significant results than the permutation test. We use our Weighted Exclusivity Test (WExT) software to analyze hundreds of colorectal and endometrial samples from The Cancer Genome Atlas, which are two cancer types that often have extremely high mutation rates. On both cancer types, the weighted test identifies sets of mutually exclusive mutations in cancer genes with fewer false positives than earlier approaches.

    Availability and Implementation

    See http://compbio.cs.brown.edu/projects/wext for software.

    Contact

    braphael@cs.brown.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less