skip to main content


Title: Detecting genomic deletions from high-throughput sequence data with unsupervised learning
Abstract Background

Structural variation (SV), which ranges from 50 bp to$$\sim$$ 3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals.

Results

In this paper, we develop a new method called EigenDel for detecting the germline submicroscopic genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data.

Conclusions

Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias. EigenDel can be downloaded fromhttps://github.com/lxwgcool/EigenDel.

 
more » « less
NSF-PAR ID:
10393318
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Volume:
23
Issue:
S8
ISSN:
1471-2105
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35–0.65 bits per base which is 3–6$$\times$$×lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4$$\times$$×faster decompression with 20 threads). NanoSpring is available on GitHub athttps://github.com/qm2/NanoSpring.

     
    more » « less
  2. Abstract

    Database peptide search is the primary computational technique for identifying peptides from the mass spectrometry (MS) data. Graphical Processing Units (GPU) computing is now ubiquitous in the current-generation of high-performance computing (HPC) systems, yet its application in the database peptide search domain remains limited. Part of the reason is the use of sub-optimal algorithms in the existing GPU-accelerated methods resulting in significantly inefficient hardware utilization. In this paper, we design and implement a new-age CPU-GPU HPC framework, calledGiCOPS, for efficient and complete GPU-acceleration of the modern database peptide search algorithms on supercomputers. Our experimentation shows that the GiCOPS exhibits between 1.2 to 5$$\times$$×speed improvement over its CPU-only predecessor, HiCOPS, and over 10$$\times$$×improvement over several existing GPU-based database search algorithms for sufficiently large experiment sizes. We further assess and optimize the performance of our framework using the Roofline Model and report near-optimal results for several metrics including computations per second, occupancy rate, memory workload, branch efficiency and shared memory performance. Finally, the CPU-GPU methods and optimizations proposed in our work for complex integer- and memory-bounded algorithmic pipelines can also be extended to accelerate the existing and future peptide identification algorithms. GiCOPS is now integrated with our umbrella HPC framework HiCOPS and is available at:https://github.com/pcdslab/gicops.

     
    more » « less
  3. Abstract

    In experiments with significant perturbations to transcription, nascent RNA sequencing protocols are dependent on external spike-ins for reliable normalization. Unlike in RNA-seq, these spike-ins are not standardized and, in many cases, depend on a run-on reaction that is assumed to have constant efficiency across samples. To assess the validity of this assumption, we analyze a large number of published nascent RNA spike-ins to quantify their variability across existing normalization methods. Furthermore, we develop a new biologically-informed Bayesian model to estimate the error in spike-in based normalization estimates, which we term Virtual Spike-In (VSI). We apply this method both to published external spike-ins as well as using reads at the$$3^\prime$$3end of long genes, building on prior work from Mahat (Mol Cell 62(1):63–78, 2016.https://doi.org/10.1016/j.molcel.2016.02.025) and Vihervaara (Nat Commun 8(1):255, 2017.https://doi.org/10.1038/s41467-017-00151-0). We find that spike-ins in existing nascent RNA experiments are typically under sequenced, with high variability between samples. Furthermore, we show that these high variability estimates can have significant downstream effects on analysis, complicating biological interpretations of results.

     
    more » « less
  4. Abstract

    Given a suitable solutionV(tx) to the Korteweg–de Vries equation on the real line, we prove global well-posedness for initial data$$u(0,x) \in V(0,x) + H^{-1}(\mathbb {R})$$u(0,x)V(0,x)+H-1(R). Our conditions onVdo include regularity but do not impose any assumptions on spatial asymptotics. We show that periodic profiles$$V(0,x)\in H^5(\mathbb {R}/\mathbb {Z})$$V(0,x)H5(R/Z)satisfy our hypotheses. In particular, we can treat localized perturbations of the much-studied periodic traveling wave solutions (cnoidal waves) of KdV. In the companion paper Laurens (Nonlinearity. 35(1):343–387, 2022.https://doi.org/10.1088/1361-6544/ac37f5) we show that smooth step-like initial data also satisfy our hypotheses. We employ the method of commuting flows introduced in Killip and Vişan (Ann. Math. (2) 190(1):249–305, 2019.https://doi.org/10.4007/annals.2019.190.1.4) where$$V\equiv 0$$V0. In that setting, it is known that$$H^{-1}(\mathbb {R})$$H-1(R)is sharp in the class of$$H^s(\mathbb {R})$$Hs(R)spaces.

     
    more » « less
  5. Abstract

    In this paper we disprove part of a conjecture of Lieb and Thirring concerning the best constant in their eponymous inequality. We prove that the best Lieb–Thirring constant when the eigenvalues of a Schrödinger operator$$-\Delta +V(x)$$-Δ+V(x)are raised to the power$$\kappa $$κis never given by the one-bound state case when$$\kappa >\max (0,2-d/2)$$κ>max(0,2-d/2)in space dimension$$d\ge 1$$d1. When in addition$$\kappa \ge 1$$κ1we prove that this best constant is never attained for a potential having finitely many eigenvalues. The method to obtain the first result is to carefully compute the exponentially small interaction between two Gagliardo–Nirenberg optimisers placed far away. For the second result, we study the dual version of the Lieb–Thirring inequality, in the same spirit as in Part I of this work Gontier et al. (The nonlinear Schrödinger equation for orthonormal functions I. Existence of ground states. Arch. Rat. Mech. Anal, 2021.https://doi.org/10.1007/s00205-021-01634-7). In a different but related direction, we also show that the cubic nonlinear Schrödinger equation admits no orthonormal ground state in 1D, for more than one function.

     
    more » « less