Polymerase chain reaction (PCR) has long been the mainstay in genetic sequencing and identification. Irrespective of whether short read or long read technologies are adopted, PCR methods are generally time consuming and expensive. Recently, an all-electronic approach, the so-called Single Molecule Break Junction (SMBJ) method, has been proposed as a possible alternative to PCR. In this article, we evaluate the performance of four different classifier models on the current signatures of ten short strand sequences, including a pair that differs by a single mismatch. We find that a gradient boosted tree classifier model achieves impressive accuracies, ranging from approximately 96% for molecules differing by a single mismatch to 99.5% otherwise.
more »
« less
Classification of DNA Sequences: Performance Evaluation of Multiple Machine Learning Methods
Polymerase chain reaction (PCR) has long been the mainstay in genetic sequencing and identification. Irrespective of whether short read or long read technologies are adopted, PCR methods are generally time consuming and expensive. Recently, an all-electronic approach, the so-called Single Molecule Break Junction (SMBJ) method, has been proposed as a possible alternative to PCR. In this article, we evaluate the performance of four different classifier models on the current signatures of ten short strand sequences, including a pair that differs by a single mismatch. We find that a gradient boosted tree classifier model achieves impressive accuracies, ranging from approximately 96% for molecules differing by a single mismatch to 99.5% otherwise.
more »
« less
- Award ID(s):
- 1807391
- PAR ID:
- 10467293
- Publisher / Repository:
- IEEE
- Date Published:
- ISSN:
- 1944-9380
- ISBN:
- 978-1-6654-5225-0
- Page Range / eLocation ID:
- 333 to 336
- Format(s):
- Medium: X
- Location:
- Palma de Mallorca, Spain
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundThe all-electronic Single Molecule Break Junction (SMBJ) method is an emerging alternative to traditional polymerase chain reaction (PCR) techniques for genetic sequencing and identification. Existing work indicates that the current spectra recorded from SMBJ experimentations contain unique signatures to identify known sequences from a dataset. However, the spectra are typically extremely noisy due to the stochastic and complex interactions between the substrate, sample, environment, and the measuring system, necessitating hundreds or thousands of experimentations to obtain reliable and accurate results. ResultsThis article presents a DNA sequence identification system based on the current spectra of ten short strand sequences, including a pair that differs by a single mismatch. By employing a gradient boosted tree classifier model trained on conductance histograms, we demonstrate that extremely high accuracy, ranging from approximately 96 % for molecules differing by a single mismatch to 99.5 % otherwise, is possible. Further, such accuracy metrics are achievable in near real-time with just twenty or thirty SMBJ measurements instead of hundreds or thousands. We also demonstrate that a tandem classifier architecture, where the first stage is a multiclass classifier and the second stage is a binary classifier, can be employed to boost the single mismatched pair’s identification accuracy to 99.5 %. ConclusionsA monolithic classifier, or more generally, a multistage classifier with model specific parameters that depend on experimental current spectra can be used to successfully identify DNA strands.more » « less
-
Abstract For any genome-based research, a robust genome assembly is required. De novo assembly strategies have evolved with changes in DNA sequencing technologies and have been through at least three phases: i) short-read only, ii) short- and long-read hybrid, and iii) long-read only assemblies. Each of the phases has their own error model. We hypothesized that hidden scaffolding errors in short-read assembly and erroneous long-read contigs degrades the quality of short- and long-read hybrid assemblies. We assembled the genome of T. borchgrevinki from data generated during each of the three phases and assessed the quality problems we encountered. We developed strategies such as k-mer-assembled region replacement, parameter optimization, and long-read sampling to address the error models. We demonstrated that a k-mer based strategy improved short-read assemblies as measured by BUSCO while mate-pair libraries introduced hidden scaffolding errors and perturbed BUSCO scores. Further, we found that although hybrid assemblies can generate higher contiguity they tend to suffer from lower quality. In addition, we found long-read only assemblies can be optimized for contiguity by sub-sampling length-restricted raw reads. Our results indicate that long-read contig assembly is the current best choice and that assemblies from phase I and phase II were of lower quality.more » « less
-
Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr ) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles ) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settingsmore » « less
-
Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.more » « less
An official website of the United States government

