skip to main content


Search for: All records

Award ID contains: 1705077

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Alkan, Can (Ed.)
    Abstract Motivation

    Detection of structural variants (SVs) from the alignment of sample DNA reads to the reference genome is an important problem in understanding human diseases. Long reads that can span repeat regions, along with an accurate alignment of these long reads play an important role in identifying novel SVs. Long-read sequencers, such as nanopore sequencing, can address this problem by providing very long reads but with high error rates, making accurate alignment challenging. Many errors induced by nanopore sequencing have a bias because of the physics of the sequencing process and proper utilization of these error characteristics can play an important role in designing a robust aligner for SV detection problems. In this article, we design and evaluate HQAlign, an aligner for SV detection using nanopore sequenced reads. The key ideas of HQAlign include (i) using base-called nanopore reads along with the nanopore physics to improve alignments for SVs, (ii) incorporating SV-specific changes to the alignment pipeline, and (iii) adapting these into existing state-of-the-art long-read aligner pipeline, minimap2 (v2.24), for efficient alignments.

    Results

    We show that HQAlign captures about 4%–6% complementary SVs across different datasets, which are missed by minimap2 alignments while having a standalone performance at par with minimap2 for real nanopore reads data. For the common SV calls between HQAlign and minimap2, HQAlign improves the start and the end breakpoint accuracy by about 10%–50% for SVs across different datasets. Moreover, HQAlign improves the alignment rate to 89.35% from minimap2 85.64% for nanopore reads alignment to recent telomere-to-telomere CHM13 assembly, and it improves to 86.65% from 83.48% for nanopore reads alignment to GRCh37 human genome.

    Availability and implementation

    https://github.com/joshidhaivat/HQAlign.git.

     
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  2. Group testing is a technique that can reduce the number of tests needed to identify infected members in a population, by pooling together multiple diagnostic samples. Despite the variety and importance of prior results, traditional work on group testing has typically assumed independent infections. However, contagious diseases among humans, like SARS-CoV-2, have an important characteristic: infections are governed by community spread, and are therefore correlated. In this paper, we explore this observation and we argue that taking into account the community structure when testing can lead to significant savings in terms of the number of tests required to guarantee a given identification accuracy. To show that, we start with a simplistic (yet practical) infection model, where the entire population is organized in (possibly overlapping) communities and the infection probability of an individual depends on the communities (s)he participates in. Given this model, we compute new lower bounds on the number of tests for zero-error identification and design community-aware group testing algorithms that can be optimal under assumptions. Finally, we demonstrate significant benefits over traditional, community-agnostic group testing via simulations using both noiseless and noisy tests 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  3. In this paper, we propose algorithms that leverage a known community structure to make group testing more efficient. We consider a population organized in connected communities: each individual participates in one or more communities, and the infection probability of each individual depends on the communities (s)he participates in. Use cases include students who participate in several classes, and workers who share common spaces. Group testing reduces the number of tests needed to identify the infected individuals by pooling diagnostic samples and testing them together. We show that making testing algorithms aware of the community structure, can significantly reduce the number of tests needed both for adaptive and non-adaptive group testing. 
    more » « less
  4. SIR (Susceptible, Infected or Recovered) stochastic network models are commonly used to describe the progression of epidemics inside a network. A task of interest in epidemiology is to use these models to estimate the state evolution, both at an individual as well as a population level. In this paper, we propose using continual testing to improve the state estimation at the individual level. Our testing is inspired from entropy reduction principles and requires only a small number of tests. 
    more » « less
  5. Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both the single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels. 
    more » « less
  6. In this paper, we propose algorithms that leverage a known community structure to make group testing more efficient. We consider a population organized in disjoint communities: each individual participates in a community, and its infection probability depends on the community (s)he participates in. Use cases include families, students who participate in several classes, and workers who share common spaces. Group testing reduces the number of tests needed to identify the infected individuals by pooling diagnostic samples and testing them together. We show that if we design the testing strategy taking into account the community structure, we can significantly reduce the number of tests needed for adaptive and non-adaptive group testing, and can improve the reliability in cases where tests are noisy. 
    more » « less
  7. null (Ed.)
    Efficient and accurate alignment of DNA/RNA sequence reads to each other or to a reference genome/transcriptome is an important problem in genomic analysis. Nanopore sequencing has emerged as a major sequencing technology and many long-read aligners have been designed for aligning nanopore reads. However, the high error rate makes accurate and efficient alignment difficult. Utilizing the noise and error characteristics inherent in the sequencing process properly can play a vital role in constructing a robust aligner. In this article, we design QAlign, a pre-processor that can be used with any long-read aligner for aligning long reads to a genome/transcriptome or to other long reads. The key idea in QAlign is to convert the nucleotide reads into discretized current levels that capture the error modes of the nanopore sequencer before running it through a sequence aligner.We show that QAlign is able to improve alignment rates from around 80\% up to 90\% with nanopore reads when aligning to the genome. We also show that QAlign improves the average overlap quality by 9.2, 2.5 and 10.8\% in three real datasets for read-to-read alignment. Read-to-transcriptome alignment rates are improved from 51.6\% to 75.4\% and 82.6\% to 90\% in two real datasets.https://github.com/joshidhaivat/QAlign.git.Supplementary data are available at Bioinformatics online. 
    more » « less